I have some csv files and I want to copy a specific column from all of them and save it in a new csv file column wise.But the following code add them in a single column.
Also in total I have to go through almost 20M data so I don't want to store them in a single dataframe and save them in last.
Here is my code:
import os
import glob
import pandas as pd
k= glob.glob("*.csv")
colu="Close"
file="merged.csv"
temp_dirr="./temp/"
if not os.path.exists(temp_dirr):
os.makedirs(temp_dirr)
filename=temp_dirr+file
df=pd.read_csv(k[0])[colu].dropna()
df.to_csv(filename,header=False,index=False)
for i in k[1:]:
df=pd.read_csv(i)[colu].dropna()
df.to_csv(filename,mode="a",header=False,index=False)
and here is the output merged.csv file
23.6
1065
23.45
1150
172.7
11098
11443.3
But i want the output file to be like this
23.6 172.7
1065 11098
23.45 11443.3
1150
Here the folder has 2 csv files and the two columns are for for the "close" column of those 2 files. So how to add them columnwise?
you can do it this way:
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f, **kwargs) for f in flist], axis=1)
fmask = '*.csv'
# column numbers are starting from 0, so 9th column has index 8
df = get_merged_csv(glob.glob(fmask), usecols=[8])
df.to_csv(filename,mode="a",header=False,index=False)
I'm not sure how to do this using Pythond, but in R, it is very easy.
Merge all columns in File1 and Column12 in File2.
import pandas as pd
file1 = pd.read_table('C:\\Users\Users\\your_path_here\\Book1.csv', delimiter=',', header=None)
file2 = pd.read_table('C:\\Users\\Users\\your_path_here\\Book2.csv', delimiter=',', header=None)
file2_short = file2.ix[:,12:13]
#print (file2_short)
frames=[file1, file2_short]
new = pd.concat(frames)
new.to_csv('C:\\Users\\your_path_here\\newfile.csv')
Related
I am reading csv files form multiple zip files to a dataframe and then using .to_csv to save the df with the below code.
import glob
import zipfile
import pandas as pd
dfs = []
for zip_file in glob.glob(r"C:\Users\harsh\Desktop\Temp\*.zip"):
zf = zipfile.ZipFile(zip_file)
dfs += [pd.read_csv(zf.open(f), header=None, sep=";", encoding='latin1') for f in zf.namelist()]
df = pd.concat(dfs,ignore_index=True)
df.to_csv("C:\Users\harsh\Desktop\Temp\data.csv")
However, I am getting a single column with , seperator
example:
0
0 Div,Date,HomeTeam,AwayTeam,FTHG,FTAG,FTR,HTHG,...
1 SC3,05/08/00,Albion Rvs,East Fife,0,1,A,0,0,D,...
...
215179 ,,,,,,,,,
There are NaN values as well in the df
Is there any way to save the df with proper structure and data in respective columns?
So I have a csv file where I need to filter the rows based on the values that I have on a txt file. Is there an easy way to do this on pandas? The csv will have about 2000 rows and the txt file has about 400 data points. I need to generate a csv with rows that match the data on the txt file.
The CSV file looks like this:
Chromosome Gene Start End
1 PERM1 5 6
2 AGRN 7 10
3 MIB2 9 12
The Text file looks like
PERM1
NADK
GNB1
Thank you
First read Text file into a list or tuple:
lines = tuple(open(filename, 'r'))
Then filter the lines exist in the text file:
df = read_csv('csvfile')
result = df[df['Chromosome Gene'].isin(lines)]
easy enough using pandas read and filter functionality. I'm assuming you have an input .csv file called input_csv_file and a filter file called filter.csv. Input file has a column "filter_locatitons" and input_file has a column called "locations":
input_df = pd.read_csv('input_csv_file.csv')
filter_df = pd.read_csv('filter.csv')
filtered_df = input_df[[input_df['location'].isin(filter_df['filter_locations']]
This can be achieved using mask and loading both the files in dataframes. Below code assumes your test file does not have a header and your csv file is space delimited
import pandas as pd
df1 = pd.read_csv('csvfile.csv', delimiter=' ')
df2 = pd.read_csv('textfile.txt', header=None)
df2.columns = ['Gene']
m = df1.Gene.isin(df2.Gene)
df3 = df1[m]
print(df3)
I have 5 Excel files, each file contains 8 sheets, each sheet contains around 30 rows. So this means 1 file has in total 30 x 8 = 240 rows. Is there a speedy trick I can use to combine all these 5 files (including sheets) into 1 Excelfile with in total 240 x 5 rows = 1200 rows?
This is my current code:
import os
import pandas as pd
files = os.listdir('c:\data\KM\Desktop\my_folder')
os.chdir(r'c:\data\KM\Desktop\my_folder')
df = pd.DataFrame()
for file in files:
if file.endswith('.xlsx'):
df = df.append(pd.read_excel(file))
df.head()
df.to_excel('all_files.xlsx')
Now with this code I have 2 problems:
From all the files I just get the 1st sheet. So its merges 8 sheets in total instead of 40 (8x5) :(
For every file it copies also the column headers, this needs to be done just for the 1st file. All the files and sheets have the same column names.
Appreciate your help all.
Use read_excel with sheet_name=None for all sheets, join together by concat for list of DataFrames and last use it again for one big DataFrame:
import glob
files = glob.glob(r'c:/data/KM/Desktop/my_folder/*.xlsx')
dfs = (pd.concat(pd.read_excel(fp, sheet_name=None)) for fp in files)
dfbig = pd.concat(dfs, ignore_index=True)
Edit: For remove last sheetname convert orderdict to lists of DataFrame and remiove last by indexing:
files = glob.glob(r'D:/Dropbox/work-joy/so/files/*.xlsx')
dfs = (pd.concat([v for k, v in pd.read_excel(fp, sheet_name=None).items()][:-1])
for fp in files)
df = pd.concat(dfs, ignore_index=True)
I have 200 .txt files and need to extract one row data from each file and create a different dataframe.
For example (abc1.txt,abc2.txt, .etc) set of files and i need to extract 5th row data from each file and create a dataframe. When reading files, columns need to be separated by '/t' sign.
like this
data = pd.read_csv('abc1.txt', sep="\t", header=None)
I can not figure out how to do all this with a loop. Can you help?
Here is my answer:
import pandas as pd
from pathlib import Path
path = Path('path/to/dir')
files = path.glob('*.txt')
to_concat = []
for f in files:
df = pd.read_csv(f, sep="\t", header=None, nrows=5).loc[4:4]
to_concat.append(df)
result = pd.concat(to_concat)
I have used nrows to read only first 5 rows and then .loc[4:4] to get dataframe rather than series (when you use .loc[4].
Here you go:
import os
import pandas as pd
directory = 'C:\\Users\\PC\\Desktop\\datafiles\\'
aggregate = pd.DataFrame()
for filename in os.listdir(directory):
if filename.endswith(".txt"):
data = pd.read_csv(directory+filename, sep="\t", header=None)
row5 = pd.DataFrame(data.iloc[4]).transpose()
aggregate = aggregate.append(row5)
I have data in 10 individual csv files. Each csv file just has one row of data entires (500000 data points, no headers etc.). Three questions:
How can I transform the data to be one column with 500000 rows?
Is it better to import them into one numpy array: 500000 x 10 to analyze them. If so, how can one do this?
Or is it better to import them into one DataFrame 500000 x 10, to analyze it.
Assume you have a list of file names files. Then:
df = pd.concat([pd.read_csv(f, header=None) for f in files], ignore_index=True)
df is a 10 x 500000 dataframe. Make it a 500000 x 10 with df.T
Answers to 2 and 3 depends on your task.
First, read all 10 csv:
import os, csv, numpy
import pandas as pd
my_csvs = os.listdir('path to folder with 10 csvs') #selects all files in folder
list_of_columns = []
os.chdir('path to folder with 10 csvs')
for file in my_csvs:
column = []
with open(file, 'r') as f:
reader = csv.reader(f)
for row in reader:
column.append(row)
list_of_columns.append(column)
This is how you get a list of lists-columns. Next transform them to pandas df or numpy or whatever you feel comfortable to work with.