Python: need to create a new column when merging multiple csv files - python

thanks for help in advance. multi-part question
I have zip files that has multiple stock pricing info. the current format is
Header row is:
ticker,date,open,high,low,close,vol
and first row example is
AAPL,201906030900,176.32,176.32,176.24,176.29,2247
desired format:
header
ticker,date,time,open,high,low,close,vol
and data
AAPL,20190603,09:00,176.32,176.32,176.24,176.29,2247
where the time column is added and the column is filled with the last 4 digits from the date row with a colon in the middle and those last 4 digits are removed from the date data column.
there about 400 rows of data for each stock in each file so each row would need to be converted.
i haven't been able to find an answer here or elsewhere on the web that i could understand how to accomplish what i am trying to do.

Try the following, using pandas:
data.csv
ticker,date,open,high,low,close,vol
AAPL,201906030900,176.32,176.32,176.24,176.29,2247
ABCD,202002211000,220.97,217.38,221.43,219.82,8544
code
import pandas as pd
df = pd.read_csv('data.csv')
# print(df)
df['time'] = df['date'].apply(lambda x: f'{str(x)[-4:-2]}:{str(x)[-2:]}')
df['date'] = df['date'].apply(lambda x: str(x)[:-4])
cols = df.columns.to_list()
cols = cols[:2] + cols[-1:] + cols[2:-1]
df = df[cols]
# print(df)
df.to_csv('out.csv', index=False)
output.csv
ticker,date,time,open,high,low,close,vol
AAPL,20190603,09:00,176.32,176.32,176.24,176.29,2247
ABCD,20200221,10:00,220.97,217.38,221.43,219.82,8544
You can use the same code to loop over multiple files.

Related

python pandas: how to modify column header name and modify the date formate

Using python pandas how can we change the data frame
First, how to copy the column name down to other cell(blue)
Second, delete the row and index column(orange)
Third, modify the date formate(green)
I would appreciate any feedback~~
Update
df.iloc[1,1] = df.columns[0]
df = df.iloc[1:].reset_index(drop=True)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
df = df.set_index('Date')
print(df.columns)
Question 1 - How to copy column name to a column (Edit- Rename column)
To rename a column pandas.DataFrame.rename
df.columns = ['Date','Asia Pacific Equity Fund']
# Here the list size should be 2 because you have 2 columns
# Rename using pandas pandas.DataFrame.rename
df.rename(columns = {'Asia Pacific Equity Fund':'Date',"Unnamed: 1":"Asia Pacific Equity Fund"}, inplace = True)
df.columns will return all the columns of dataframe where you can access each column name with index
Please refer Rename unnamed column pandas dataframe to change unnamed columns
Question 2 - Delete a row
# Get rows from first index
df = df.iloc[1:].reset_index()
# To remove desired rows
df.drop([0,1]).reset_index()
Question 3 - Modify the date format
current_format = '%Y-%m-%d %H:%M:%S'
desired_format = "%Y-%m-%d"
df['Date'] = pd.to_datetime(df['Date']).dt.strftime(desired_format)
# Input the existing format
df['Date'] = pd.to_datetime(df['Date'], infer_datetime_format=current_format).dt.strftime(desired_format)
# To update date format of Index
df.index = pd.to_datetime(df.index,infer_datetime_format=current_format).strftime(desired_format)
Please refer pandas.to_datetime for more details
I'm not sure I understand your questions. I mean, do you actually want to change the dataframe or how it is printed/displayed?
Indexes can be changed by using methods .set_index() or .reset_index(), or can be dropped eventually. If you just want to remove the first digit from each index (that's what I understood from the orange column), you should then create a list with the new indexes and pass it as a column to your dataframe.
Regarding the date format, it depends on what you want the changed format to become. Take a look into python datetime.
I would strongly suggest you to take a better look into pandas features and documentations, and how to handle a dataframe with this library. There is plenty of great sources a Google-search away :)
Delete the first two rows using this.
Rename the second column using this.
Work with datetime format using the datetime package. Read about it here

Dropping index in DataFrame for CSV file

Working with a CSV file in PyCharm. I want to delete the automatically-generated index column. When I print it, however, the answer I get in the terminal is "None". All the answers by other users indicate that the reset_index method should work.
If I just say "df = df.reset_index(drop=True)" it does not delete the column, either.
import pandas as pd
df = pd.read_csv("music.csv")
df['id'] = df.index + 1
cols = list(df.columns.values)
df = df[[cols[-1]]+cols[:3]]
df = df.reset_index(drop=True, inplace=True)
print(df)
I agree with #It_is_Chris. Also,
This is not true because return is None:
df = df.reset_index(drop=True, inplace=True)
It's should be like this:
df.reset_index(drop=True, inplace=True)
or
df = df.reset_index(drop=True)
Since you said you're trying to "delete the automatically-generated index column" I could think of two solutions!
Fist solution:
Assign the index column to your dataset index column. Let's say your dataset has already been indexed/numbered, then you could do something like this:
#assuming your first column in the dataset is your index column which has the index number of zero
df = pd.read_csv("yourfile.csv", index_col=0)
#you won't see the automatically-generated index column anymore
df.head()
Second solution:
You could delete it in the final csv:
#To export your df to a csv without the automatically-generated index column
df.to_csv("yourfile.csv", index=False)

Removing rows of duplicate headers or strings same columns and blank lines in pandas in python

I have a sample data (Data_sample_truncated.txt) which I truncated from a big data. It has 3 fields - "Index", "Time" and "RxIn.Density[**x**, ::]" Here I used x as integer as x can vary for any range. In this data it is 0-15. The combination of the 3 column fields is unique. For different "Index" field the "Time" and "RxIn.Density[**x**, ::]" can be same or different. For each new "Index" value the data has a blank line and almost similar column headers except for "RxIn.Density[**x**, ::]" where x is increasing when new "Index" value is reached. The data which I export from ADS (circuit simulation software) gives me like this format while exporting.
Now I want to format the data so that all the data are merged together under 3 unique column fields - "Index", "Time" and "RxIn.Density". You can see I want to remove the strings [**x**, ::] in the new dataframe of the 3rd column. Here is the sample final data file that I want after formatting (Data-format_I_want_after_formatting.txt). So I want the following -
The blank lines (or rows) to be removed
All the other header lines to be removed keeping the top header only and changing the 3rd column header to "RxIn.Density"
Keeping all the data merged under the unique column fields - "Index", "Time" and "RxIn.Density", even if the data values are duplicate.
My MATLAB code is in the below:
import pandas as pd
#create DataFrame from csv with columns f and v
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+", names=['index','time','v'])
#boolean mask for identify columns of new df
m = df['v'].str.contains('RxIn')
#new column by replace NaNs by forward filling
df['g'] = df['v'].where(m).ffill()
#get original ordering for new columns
#cols = df['g'].unique()
#remove rows with same values in v and g columns
#df = df[df['v'] != df['g']]
df = df.drop_duplicates(subset=['index', 'time'], keep=False)
df.to_csv('target.txt', index=False, sep='\t')
The generated target.txt file is not what I wanted. You can check it here. Can anyone help what is wrong with my code and what to do to fix it so that I wan my intended formatting?
I am using Spyder 3.2.6 (Anaconda) where python 3.6.4 64-bit is embedded.
You can just filter out rows, that you do not want(check this):
import pandas as pd
df = pd.read_csv('Data_sample_truncated.txt', sep="\s+")
df.columns = ["index","time","RxIn.Density","1"]
del df["1"]
df = df[~df["RxIn.Density"].str.contains("Rx")].reset_index(drop=True)
df.to_csv('target.txt', index=False, sep='\t')
Try this:
df = pd.read_csv('Data_sample_truncated.txt', sep='\s+', names=['index', 'time', 'RxIn.Density', 'mask'], header=None)
df = df[df['mask'].isna()].drop(['mask'], axis=1)
df.to_csv('target.txt', index=False, sep='\t')

Pandas drop method does not work.

I want to filter a rather large Pandas dataframe (about 3 million rows) by date.
For some reason the drop method when used with boolean criteria does not work at all. It just returns the same old dataframe. Dropping single rows is no problem though.
This is the code is used initially, which essentially does nothing at all:
import pandas as pd
#open the file
df = pd.read_csv('examplepath/examplefile.csv', names=['File Name','FileSize','File Type','Date Created','Date Last Accessed','Date Last Modified','Path'],\
delimiter=';', header=None, encoding="ISO-8859-1",)
#convert to german style date
df['Date Created'] = pd.to_datetime(df['Date Created'], dayfirst=True)
#drop rows and assign new dataframe
df_filtered = df.drop(df[df['Date Created'] > datetime(2010,1,1)])
I then came up with this code, which seemingly works like a charm:
import pandas as pd
#open the file
df = pd.read_csv('examplepath/examplefile.csv', names=['File Name','FileSize','File Type','Date Created','Date Last Accessed','Date Last Modified','Path'],\
delimiter=';', header=None, encoding="ISO-8859-1",)
#convert to german style date
df['Date Created'] = pd.to_datetime(df['Date Created'], dayfirst=True)
#select rows and assign new dataframe
df_filtered = df['Date Created'] < datetime(2010,1,1)
Both codes in theory should do the same thing, right?
Is one of the codes to be preferred? Can I just work with my second code? In the future I may have to add a second filterdate.
I hope someone can help me.
Thanks and best regards,
Stefan
You've got to give index list or column names to 'drop' either rows or columns respectively.
Read docs and examples given.
Your second approach works because that is the way you filter a dataframe.
You may use it at will.

iterate through csv columns to create multiple python dataframe

I am trying to create multiple data frames using the columns of a excel csv file. This is where I have been able to get to
import pandas as pd
file = pd.read_csv('file.csv')
df = pd.DataFrame(file)
cols = df.columns
#column names are 'Date', 'Stock 1', 'Stock 2', etc - I have 1000 columns
for i in range(len(cols)):
df[i] = df[['Date',b(i)]]
So the end result is I want multiple dataframes. The first dataframe is with columns 1 and 2 (so Date and Stock 1), the second dataframe is with columns 1 and 3 (so Date and Stock 2), the third dataframe is with columns 1 and 3, creating new dataframe all the way to Columns 1 and 1000.
I have tried several ways and either get index in not callable or I tried with usecols and I get usecols must be strings or integers.
Can anyone help me with this. Conceptually it is easy but I can not get the code right. Thank you.
This does what you are asking:
all_dfs = []
for col in df.columns:
if col != 'Date':
df_current = df[['Date', col]]
all_dfs.append(df_current)
Or as one line:
all_dfs = [df[['Date', col]] for col in df.columns if col != 'Date']
But you probably don't want to do that. There's not much point. What are you really trying to do?

Categories