I Need to Create Data Frame Using Multiple text files, all the text files are in same directory
Text File Format, each text file contains data showed in image
Text file format
here I need to create DataFrame using these kind of multiple text files
If it is possible to remove the last line (Name: 45559, dtype: object) then you should be able to load txt file as a csv:
import pandas as pd
import os
txt_files_dir = '...'
files = os.listdir(txt_files_dir)
dfs_list = [pd.read_csv(file, sep='\s+') for file in files]
data_frame_result = pd.concat(dfs_list, axis=0, ignore_index=True)
Related
There is a .csv file contained within a .zip file from a URL I am trying to read into a Pandas DataFrame; I don't want to download the .zip file to disk but rather read the data directly from the URL. I realize that pandas.read_csv() can only do this if the .csv file is the only file contained in the .zip, however, when I run this:
import pandas as pd
# specify zipped comma-separated values url
zip_csv_url = 'http://www12.statcan.gc.ca/census-recensement/2016/geo/ref/gaf/files-fichiers/2016_92-151_XBB_csv.zip'
df1 = pd.read_csv(zip_csv_url)
I get this:
ValueError: Multiple files found in compressed zip file ['2016_92-151_XBB.csv', '92-151-g2016001-eng.pdf', '92-151-g2016001-fra.pdf']
The contents of the .zip appear to be arranged as a list; I'm wondering how I can assign the new DataFrame (df1) as the only available .csv file in the .zip (as the .zip file from the URL I will be using would only ever have one .csv file within it). Thanks!
N.B.
The corresponding .zip file from a separate URL with shapefiles reads no problem with geopandas.read_file() when I run this code:
import geopandas as gpd
# specify zipped shapefile url
zip_shp_url = 'http://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/files-fichiers/2016/ldb_000b16a_e.zip'
gdf1 = gpd.read_file(zip_shp_url)
Despite having a .pdf file also contained within the .zip, as seen in the image below:
It would appear that the geopandas.read_file() has the ability to only read the requisite shapefiles for creating the GeoDataFrame while ignoring unnecessary data files. Since it is based on Pandas, shouldn't Pandas also have a functionality to only read a .csv within a .zip with multiple other file types? Any thoughts?
import zipfile
import pandas as pd
from io import BytesIO
from urllib.request import urlopen
resp = urlopen( YOUR_ZIP_LINK )
files_zip = zipfile.ZipFile(BytesIO(resp.read()))
# files_zip.namelist()
directory_to_extract_to = YOUR_DESTINATION_FOLDER
file = YOUR_csv_FILE_NAME
with files_zip as zip_ref:
zip_ref.extract(file,directory_to_extract_to)
pd.read_csv(directory_to_extract_to + file)
I have several large .text files that I want to consolidate into one .csv file. However, each of the files is to large to import into Excel on its own, let alone all together.
I want to create a use pandas to analyze the data, but don't know how to get the files all in one place.
How would I go about reading the data directly into Python, or into Excel for a .csv file?
The data in question is the 2019-2020 Contributions by individuals file on the FEC's website.
You can convert each of the files to csv and the concatenate them to fom one final csv file
import pandas as pd
csv_path = 'pathtonewcsvfolder' # use your path
all_files=os.listdir("path/to/textfiles")
x=0
for filename in all_files:
df = pd.read_fwf(filename)
df.to_csv(os.path.join(csv_path,'log'+str(x)+'.csv'))
x+=1
all_csv_files = glob.iglob(os.path.join(csv_path, "*.csv"))
converted_df=pd.concat((pd.read_csv(f) for f in all_csv_files), ignore_index=True)
converted_df.to_csv('converted.csv')
I have a directory of similar excel files and want to extract the first sheet from each file and save it as a .csv file. Currently have code which works to extract and save sheet from individual file:
import glob
import pandas as pd
f = glob.glob('filename.xlsx') # assume the path
for excel in f:
out = excel.split('.')[0]+'.csv'
df = pd.read_excel(excel) # if only the first sheet is needed.
df.to_csv(out)
You can get all your files into a list using glob with a list comprehension:
files_to_be_read = glob.glob("*.xlsx") #Assuming you also have the path to the folder where the excel files are saved
for i in files_to_be_read:
df_in = pd.read_excel(i) #You pass the path, pd.read_excel always uses the first sheet by default
df_out = pd.to_csv(i+'.csv') #You will save the file with the same name, but in csv format
I have multiple Excel spreedsheets in given folder and it's sub folder. All have same file name string with suffix as date and time. How to merge them all into one single file while making worksheet name and titles as index for appending data frames. Typically there would be small chunks of 200 KB each file of ~100 files in subfolders or 20 MB of ~10 files in subfolders
This may help you to merge all the xlsx file in current directory.
import glob
import os
import pandas as pd
output = pd.DataFrame()
for file in glob.glob(os.getcwd()+"\\*.xlsx"):
cn = pd.read_excel(file)
output = output.append(cn)
output.to_csv(os.getcwd()+"\\outPut.csv", index = False, na_rep = "NA", header=None)
print("Completed +::" )
Note : you need xlrd-1.1.0 library along with pandas to read xlsx files.
I have tried operating using static file name definitions, would be good if it makes consolation by column header from dynamic file list pick, whichever starts with .xls* (xls / xlsx / xlsb / xlsm) and .csv and .txt
import pandas as pd
db = pd.read_excel("/data/Sites/Cluster1 0815.xlsx")
db1 = pd.read_excel("/data/Sites/Cluster2 0815.xlsx")
db2 = read_excel("/data/Sites/Cluster3 0815.xlsx")
sdb = db.append(db1)
sdb = sdb.append(db2)
sdb.to_csv("/data/Sites/sites db.csv", index = False, na_rep = "NA", header=None)
Dynamic file list merge found to have the below output. However the processing time has to be counted on...
gur.com/QKTKw.jpg
While running on batch files the code generated below error (please to note that these file are asymmetric in information carried) attached is snap:
I have a text file that contains data like this. It is is just a small example, but the real one is pretty similar.
I am wondering how to display such data in an "Excel Table" like this using Python?
The pandas library is wonderful for reading csv files (which is the file content in the image you linked). You can read in a csv or a txt file using the pandas library and output this to excel in 3 simple lines.
import pandas as pd
df = pd.read_csv('input.csv') # if your file is comma separated
or if your file is tab delimited '\t':
df = pd.read_csv('input.csv', sep='\t')
To save to excel file add the following:
df.to_excel('output.xlsx', 'Sheet1')
complete code:
import pandas as pd
df = pd.read_csv('input.csv') # can replace with df = pd.read_table('input.txt') for '\t'
df.to_excel('output.xlsx', 'Sheet1')
This will explicitly keep the index, so if your input file was:
A,B,C
1,2,3
4,5,6
7,8,9
Your output excel would look like this:
You can see your data has been shifted one column and your index axis has been kept. If you do not want this index column (because you have not assigned your df an index so it has the arbitrary one provided by pandas):
df.to_excel('output.xlsx', 'Sheet1', index=False)
Your output will look like:
Here you can see the index has been dropped from the excel file.
You do not need python! Just rename your text file to CSV and voila, you get your desired output :)
If you want to rename using python then -
You can use os.rename function
os.rename(src, dst)
Where src is the source file and dst is the destination file
XLWT
I use the XLWT library. It produces native Excel files, which is much better than simply importing text files as CSV files. It is a bit of work, but provides most key Excel features, including setting column widths, cell colors, cell formatting, etc.
saving this is:
df.to_excel("testfile.xlsx")