iterate through csv columns to create multiple python dataframe - python

I am trying to create multiple data frames using the columns of a excel csv file. This is where I have been able to get to
import pandas as pd
file = pd.read_csv('file.csv')
df = pd.DataFrame(file)
cols = df.columns
#column names are 'Date', 'Stock 1', 'Stock 2', etc - I have 1000 columns
for i in range(len(cols)):
df[i] = df[['Date',b(i)]]
So the end result is I want multiple dataframes. The first dataframe is with columns 1 and 2 (so Date and Stock 1), the second dataframe is with columns 1 and 3 (so Date and Stock 2), the third dataframe is with columns 1 and 3, creating new dataframe all the way to Columns 1 and 1000.
I have tried several ways and either get index in not callable or I tried with usecols and I get usecols must be strings or integers.
Can anyone help me with this. Conceptually it is easy but I can not get the code right. Thank you.

This does what you are asking:
all_dfs = []
for col in df.columns:
if col != 'Date':
df_current = df[['Date', col]]
all_dfs.append(df_current)
Or as one line:
all_dfs = [df[['Date', col]] for col in df.columns if col != 'Date']
But you probably don't want to do that. There's not much point. What are you really trying to do?

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

List of Dataframes, drop Dataframe column (columns have different names) if row contains a special string

What i have is a list of Dataframes.
What is important to note is that the shape of the dataframes differ between 2-7 columns, also the columns are named between 0 & len of the column (e.g. df1 has 5 columns named 0,1,2,3,4 etc. df2 has 4 columns named 0,1,2,3)
I would like is to check if a row in a column contains a certain string, then delete that column.
list_dfs1=[df1,df2,df3...df100]
What i have done so far is the below & i get an error that column 5 is not in axis (it is there for some DF)
for i, df in enumerate(list_dfs1):
for index,row in df.iterrows():
if np.where(row.str.contains("DEC")):
df.drop(index, axis=1)
Any suggestions.
You could try:
for df in list_dfs:
for col in df.columns:
# If you are unsure about column types, cast column as string:
df[col] = df[col].astype(str)
# Check if the column contains the string of interest
if df[col].str.contains("DEC").any():
df.drop(columns=[col], inplace=True)
If you know that all columns are of type string, you don't have to actually do df[col] = df[col].astype(str).
You can write a custom function that checks whether the dataframe has the pattern or not. You can use pd.Series.str.contains with pd.Series.any
def func(s):
return s.str.contains('DEC').any()
list_df = [df.loc[:, ~df.apply(func)] for df in list_dfs1]
I would take another approach. I would concatenate the list into a data frame and then eliminate the column where finding the string
import pandas as pd
df = pd.concat(list_dfs1)
Let us say your condition was to eliminate any column with "DEC"
df.mask(df == "DEC").dropna(axis=1, how="any")

Find the difference between data frames based on specific columns and output the entire record

I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.
I found few answers to that but it is still not giving result what I expect.
Answer 1 :
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work. It works for single column but not for multiple.
Answer 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns. I want to print the whole record and not only the specific columns of the record.
I tried this and it gave me the rows:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns list. Sorry for the bad naming. If you want to rename the columns a bit too, here's the code for that:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.

Python: need to create a new column when merging multiple csv files

thanks for help in advance. multi-part question
I have zip files that has multiple stock pricing info. the current format is
Header row is:
ticker,date,open,high,low,close,vol
and first row example is
AAPL,201906030900,176.32,176.32,176.24,176.29,2247
desired format:
header
ticker,date,time,open,high,low,close,vol
and data
AAPL,20190603,09:00,176.32,176.32,176.24,176.29,2247
where the time column is added and the column is filled with the last 4 digits from the date row with a colon in the middle and those last 4 digits are removed from the date data column.
there about 400 rows of data for each stock in each file so each row would need to be converted.
i haven't been able to find an answer here or elsewhere on the web that i could understand how to accomplish what i am trying to do.
Try the following, using pandas:
data.csv
ticker,date,open,high,low,close,vol
AAPL,201906030900,176.32,176.32,176.24,176.29,2247
ABCD,202002211000,220.97,217.38,221.43,219.82,8544
code
import pandas as pd
df = pd.read_csv('data.csv')
# print(df)
df['time'] = df['date'].apply(lambda x: f'{str(x)[-4:-2]}:{str(x)[-2:]}')
df['date'] = df['date'].apply(lambda x: str(x)[:-4])
cols = df.columns.to_list()
cols = cols[:2] + cols[-1:] + cols[2:-1]
df = df[cols]
# print(df)
df.to_csv('out.csv', index=False)
output.csv
ticker,date,time,open,high,low,close,vol
AAPL,20190603,09:00,176.32,176.32,176.24,176.29,2247
ABCD,20200221,10:00,220.97,217.38,221.43,219.82,8544
You can use the same code to loop over multiple files.

Is there a way to rename multiple df columns in Python?

I'm trying to rename multiple columns in a dataframe to certain dates with Python.
Currently, the columns are as such: 2016-04, 2016-05, 2016-06....
I would like the columns to read: April2016, May2016, June2016...
There are around 40 columns. I am guessing a for loop would be the most efficient way to do this, but I'm relatively new to Python and not sure how to concatenate the column names correctly.
You can use loops or comprehensions along with a month dictionary to split, reorder, and replace the string column names
#in your case this would be cols=df.columns
cols=['2016-04', '2016-05', '2016-06']
rpl={'04':'April','05':'May','06':'June'}
cols=[\
''.join(\
[rpl[i.split('-')[1]],
i.split('-')[0]]) \
for i in cols]
cols
['April2016', 'May2016', 'June2016']
#then you would assign it back with df.columns = cols
You didn't share your dataframe so I used basic dataframe to explain how to get month is given date.I supposed your dataframe likes:
d = {'dates': ['2016-04', '2016-05','2016-06']} #just 3 of them
so all code :
import datetime
import pandas as pd
d = {'dates': ['2016-04', '2016-05','2016-06']}
df = pd.DataFrame(d)
for index, row in df.iterrows():
get_date= row['dates'].split('-')
get_month = get_date[1]
month = datetime.date(1900, int(get_month), 1).strftime('%B')
print (month+get_date[0])
OUTPUT :
2016April
2016May
2016June

Categories