Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 days ago.
Improve this question
I have a folder that contains 1200 feather files, I wanna import these feather files into python so that I can analyze the data in them. How am I supposed to do that?
I've tried methods like feather.read_dataframe() or using the file path and none of them succeeds.
You can use the following code to read feather files and concat them into single dataframe. From there you can analyze the data.
import pandas as pd
from pathlib import Path
# Set the folder path where the feather files are located
folder_path = "/path/to/folder"
# Create an empty list to store the dataframes
dfs = []
# Loop through all the feather files in the folder and read them into pandas dataframes
for file_path in Path(folder_path).glob("*.feather"):
df = pd.read_feather(file_path)
dfs.append(df)
# Concatenate the dataframes into a single dataframe
result = pd.concat(dfs, ignore_index=True)
Path(folder_path).glob("*.feather") returns a list of all the files that end with .feather in the folder_path
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 months ago.
Improve this question
I am currently trying to sum two .txt files containing each other over 35 millions value and put the result in a third file.
File 1 :
2694.28
2694.62
2694.84
2695.17
File 2 :
1.483429484776452
2.2403221757269196
1.101004844694236
1.6119626937837102
File 3 :
2695.76343
2696.86032
2695.941
2696.78196
Any idea to do that with python ?
You can use numpy for speed. It will be much faster than pure python. Numpy uses C/C++ for a lot of it's operations.
import numpy
import os
path = os.path.dirname(os.path.realpath(__file__))
file_name_1 = path + '/values_1.txt'
file_name_2 = path + '/values_2.txt'
a = numpy.loadtxt(file_name_1, dtype=float)
b = numpy.loadtxt(file_name_2, dtype=float)
c = a + b
precision = 10
numpy.savetxt(path + '/sum.txt', c, fmt=f'%-.{precision}f')
This assumes your .txt files are located where your python script is located.
You can use pandas.read_csv to read, sum, and then write chunks of your file.
Presumably all 35 million records do not stay in memory. You need to read the file by chunk. In this way you read one chunk at a time, and load into memory only one chunk (2 actually one for file1 and one for file2), do the sum and write into memory one chunk at a time in append mode on file3.
In this dummy example I put as chunksize=2, because doing tests on your inputs that are 4 long. It depends on the server you are working on, do some tests and see what is the best size for your problem (50k, 100k, 500k, 1kk etc).
import pandas as pd
chunksize = 2
with pd.read_csv("file1.txt", chunksize=chunksize, header=None) as reader1, pd.read_csv("file2.txt", chunksize=chunksize, header=None) as reader2:
for chunk1, chunk2 in zip(reader1, reader2):
(chunk1 + chunk2).to_csv("file3.txt", index=False, header=False, mode='a')
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I want to convert the following link to Excel using Python language so that it stores information about the country and the capital and their code in an Excel file
Can you please guide me?
https://restcountries.eu/rest/v2/all
Pandas library will come to rescue here, though extracting your nested json is more of a python skills. You can follow the following to simply extract desired columns:
import pandas as pd
url = 'https://restcountries.eu/rest/v2/all';
#Load json to a dataframe
df = pd.read_json(url);
# Create DF with Country, capital and code fields. You can use df.head() to see how your data looks in table format and columns name.
df_new = df[['name', 'capital', 'alpha2Code', 'alpha3Code']].copy()
#Use pandas ExcelWriter to write the desired DataFrame to xlsx file.
with pd.ExcelWriter('country_names.xlsx') as writer:
df_new.to_excel(writer, sheet_name="Country List")
Sample Data from the generated Excel File
Full info on ExcelWriter module can be read at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html
You will need to play around to change the columns names and clean up the data (especially the nested objects) and these should be just a search away.
Your best bet would be to use pandas to read the JSON from the URL that you have mentioned and save it to an excel file. Here's the code for it:
import pandas as pd
# Loading the JSON from the URL to a pandas dataframe
df = pd.read_json('https://restcountries.eu/rest/v2/all')
# Selecting the columns for the country name, capital, and the country code (as mentioned in the question)
df = df[["name", "capital", "alpha2Code"]]
# Saving the data frame into an excel file named 'restcountries.xlsx', but feel free to change the name
df.to_excel('restcountries.xlsx')
However, there will be an issue with reading nested fields (if you want to read them in the future). For example, the fields named borders and currencies in your dataset are lists. So, you might need some post-processing after you load it.
Cheers!
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have some data of 50 people in 50 different excel files placed in the same folder. For each person the data is present in five different files like shown below:
Example:
Person1_a.xls, Person1_b.xls, Person1_c.xls, Person1_d.xls, Person1_e.xls.
Each excel sheet has two columns and multiple sheets. I need to create a file Person1.xls which will have the second column of all these files, combined. Same process should be applicable for all the 50 people.
Any suggestions would be appreciated.
Thank you!
I have created a trial folder that I believe is similar to yours. I added data only for Person1 and Person3.
In the attached picture, the files called Person1 and Person3 are the exported files that include only the 2nd column for each person. So each person has their own file now.
I added a small description on what each line does. Please let me know if something is not clear.
import pandas as pd
import glob
path = r'C:\..\trial' # use your path where the files are
all_files = glob.glob(path + "/*.xlsx") # will get you all files with an extension .xlsx in a folder
li = []
for i in range(0,51): # numbers from 1 to 50 (for the 50 different people)
for f in all_files:
if str(i) in f: # checks if the number (i) is in the excel name
df = pd.read_excel(f,
sheet_name=0, # import 1st sheet
usecols=([1])) # only import column 2
df['person'] = f.rsplit('\\',1)[1].split('_')[0] # get the name of the person in a column
li.append(df) # add it to the list of dataframes
all_person = pd.concat(li, axis=0, ignore_index=True) # concat all dataframes imported
Then you can export to the same path, a different excel file for each different person
for i,j in all_person.groupby('person'):
j.to_excel(f'{path}\{i}.xlsx', index = False)
I am aware that this is probably not the most efficient way, but it will probably get you what you need.
This question already has answers here:
How can I iterate over files in a given directory?
(11 answers)
Closed 2 years ago.
This a a portion of a larger code.
I have a directory with 100's of .log files that I need to convert to .xlsx files one at a time. I wrote this code;
import csv
import glob
import pandas as pd
df = pd.read_csv('input.log', delimiter=r"\s+", header=None, names=list(range(20)))
df.to_excel('input.xlsx', 'Sheet1')
Which works for a single file. What do I need to add to have it look through the directory and convert each file regardless of how many there are?
import glob
import pandas as pd
files = glob.glob("*.log")
for file in files:
df = pd.read_csv(file, delimiter=r"\s+", header=None, names=list(range(20)))
df.to_excel(file+'.xlsx', index=Flase)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have my csv files in the same folder. I want to get only the data in the 5th column from all my csv files and write the data into a single file. But there are blank lines in my csv files. https://drive.google.com/file/d/1SospIppACOrLeKPU_9OknnDLnDpatIqE/view?usp=sharing
How can I keep the blanks with pandas.read_csv command?
Many thanks!
Fake data:
sapply(1:3, function(i) write.csv(mtcars, paste0(i,".csv"), row.names=FALSE))
results in three csv files, named 1.csv through 3.csv, each with:
"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
...
The code:
write.csv(sapply(list.files(pattern="*.csv"), function(a) read.csv(a)[,5]),
"agg.csv", row.names=FALSE)
results in a single CSV file, agg.csv, that contains
"1.csv","2.csv","3.csv"
3.9,3.9,3.9
3.9,3.9,3.9
3.85,3.85,3.85
3.08,3.08,3.08
...
You can use the usecols argument of pandas.read_csv.
import pandas as pd
from glob import glob
So what we are doing here is that we are looping over all files in the current directory that end with .csv and then for each of those files only read in the column of interest, i.e. the 5th column. We write usecols=[4] because pandas uses 0-based indexing, so out of 0, 1, 2, 3, 4, the fifth number is 4. Additionally you asked to skip blank lines and your sample data contains 9 blank lines leading up to actual data, so we will set skiprows to 9.
We concatenate all of those into one DataFrame using pd.concat.
combined_df = pd.concat(
[
pd.read_csv(csv_file, usecols=[4], skiprows=9)
for csv_file in glob('*.csv')
]
)
To get rid of blank lines from your DataFrame, you can simply use:
combined_df = combined_df.dropna()
This combined_df we can then simply write to file:
combined_df.to_csv('combined_column_5.csv')