Summing numbers in two diffrent .txt file in Python [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 months ago.
Improve this question
I am currently trying to sum two .txt files containing each other over 35 millions value and put the result in a third file.
File 1 :
2694.28
2694.62
2694.84
2695.17
File 2 :
1.483429484776452
2.2403221757269196
1.101004844694236
1.6119626937837102
File 3 :
2695.76343
2696.86032
2695.941
2696.78196
Any idea to do that with python ?

You can use numpy for speed. It will be much faster than pure python. Numpy uses C/C++ for a lot of it's operations.
import numpy
import os
path = os.path.dirname(os.path.realpath(__file__))
file_name_1 = path + '/values_1.txt'
file_name_2 = path + '/values_2.txt'
a = numpy.loadtxt(file_name_1, dtype=float)
b = numpy.loadtxt(file_name_2, dtype=float)
c = a + b
precision = 10
numpy.savetxt(path + '/sum.txt', c, fmt=f'%-.{precision}f')
This assumes your .txt files are located where your python script is located.

You can use pandas.read_csv to read, sum, and then write chunks of your file.
Presumably all 35 million records do not stay in memory. You need to read the file by chunk. In this way you read one chunk at a time, and load into memory only one chunk (2 actually one for file1 and one for file2), do the sum and write into memory one chunk at a time in append mode on file3.
In this dummy example I put as chunksize=2, because doing tests on your inputs that are 4 long. It depends on the server you are working on, do some tests and see what is the best size for your problem (50k, 100k, 500k, 1kk etc).
import pandas as pd
chunksize = 2
with pd.read_csv("file1.txt", chunksize=chunksize, header=None) as reader1, pd.read_csv("file2.txt", chunksize=chunksize, header=None) as reader2:
for chunk1, chunk2 in zip(reader1, reader2):
(chunk1 + chunk2).to_csv("file3.txt", index=False, header=False, mode='a')

Related

read data group wise with certain pattern [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
i have data inside a directory as follows
IU.WRT.00.MTR.1999.081.081015.txt
IU.WRT.00.MTS.2007.229.022240.txt
IU.WRT.00.MTR.2007.229.022240.txt
IU.WRT.00.MTT.1999.081.081015.txt
IU.WRT.00.MTS.1999.081.081015.txt
IU.WRT.00.MTT.2007.229.022240.txt
and i want to read data group wise,
At first I want to read 3 files with similar pattern (differ by R,S,T)
IU.WRT.00.MTR.1999.081.081015.txt
IU.WRT.00.MTS.1999.081.081015.txt
IU.WRT.00.MTT.1999.081.081015.txt
and want to apply some operations on it
and then i want to read data
IU.WRT.00.MTT.2007.229.022240.txt
IU.WRT.00.MTS.2007.229.022240.txt
IU.WRT.00.MTR.2007.229.022240.txt
and want to apply similar operation on it.
In the sameway i want to continue the process for millions of data sets.
I tried the example script
import os
import glob
import matplotlib.pyplot as plt
from collections import defaultdict
def groupfiles(pattern):
files = glob.glob(pattern)
filedict = defaultdict(list)
for file in files:
parts = file.split(".")
filedict[".".join([parts[5], parts[6], parts[7]])].append(file)
for filegroup in filedict.values():
yield filegroup
for relatedfiles in groupfiles('*.txt'):
print(relatedfiles)
for filename in relatedfiles:
print(filename)
However it reads the file one by one but i need to read 3 file at a time.I hope experts may help me.Thanks in advance.
Use proper patterns to get the files
files_1999 = glob.glob('IU.WRT.00.MT[RST].1999.081.081015.txt')
To generalize,
years = set(file.split('.')[4] for file in glob.glob('*.txt'))
file_group = {}
for year in years:
pattern = f'IU.WRT.00.MT[RST].{year}*.txt'
file_group[year] = glob.glob(pattern)
Output
{
"2007":[
"IU.WRT.00.MTS.2007.229.022240.txt",
"IU.WRT.00.MTR.2007.229.022240.txt",
"IU.WRT.00.MTT.2007.229.022240.txt"
],
"1999":[
"IU.WRT.00.MTS.1999.081.081015.txt",
"IU.WRT.00.MTR.1999.081.081015.txt",
"IU.WRT.00.MTT.1999.081.081015.txt"
]
}

Import multiple excel files in python, manipulate, and then export multiple files in same directory [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have some data of 50 people in 50 different excel files placed in the same folder. For each person the data is present in five different files like shown below:
Example:
Person1_a.xls, Person1_b.xls, Person1_c.xls, Person1_d.xls, Person1_e.xls.
Each excel sheet has two columns and multiple sheets. I need to create a file Person1.xls which will have the second column of all these files, combined. Same process should be applicable for all the 50 people.
Any suggestions would be appreciated.
Thank you!
I have created a trial folder that I believe is similar to yours. I added data only for Person1 and Person3.
In the attached picture, the files called Person1 and Person3 are the exported files that include only the 2nd column for each person. So each person has their own file now.
I added a small description on what each line does. Please let me know if something is not clear.
import pandas as pd
import glob
path = r'C:\..\trial' # use your path where the files are
all_files = glob.glob(path + "/*.xlsx") # will get you all files with an extension .xlsx in a folder
li = []
for i in range(0,51): # numbers from 1 to 50 (for the 50 different people)
for f in all_files:
if str(i) in f: # checks if the number (i) is in the excel name
df = pd.read_excel(f,
sheet_name=0, # import 1st sheet
usecols=([1])) # only import column 2
df['person'] = f.rsplit('\\',1)[1].split('_')[0] # get the name of the person in a column
li.append(df) # add it to the list of dataframes
all_person = pd.concat(li, axis=0, ignore_index=True) # concat all dataframes imported
Then you can export to the same path, a different excel file for each different person
for i,j in all_person.groupby('person'):
j.to_excel(f'{path}\{i}.xlsx', index = False)
I am aware that this is probably not the most efficient way, but it will probably get you what you need.

Extract one specific column with blank lines from multiple csv files and merge into one [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have my csv files in the same folder. I want to get only the data in the 5th column from all my csv files and write the data into a single file. But there are blank lines in my csv files. https://drive.google.com/file/d/1SospIppACOrLeKPU_9OknnDLnDpatIqE/view?usp=sharing
How can I keep the blanks with pandas.read_csv command?
Many thanks!
Fake data:
sapply(1:3, function(i) write.csv(mtcars, paste0(i,".csv"), row.names=FALSE))
results in three csv files, named 1.csv through 3.csv, each with:
"mpg","cyl","disp","hp","drat","wt","qsec","vs","am","gear","carb"
21,6,160,110,3.9,2.62,16.46,0,1,4,4
21,6,160,110,3.9,2.875,17.02,0,1,4,4
22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
...
The code:
write.csv(sapply(list.files(pattern="*.csv"), function(a) read.csv(a)[,5]),
"agg.csv", row.names=FALSE)
results in a single CSV file, agg.csv, that contains
"1.csv","2.csv","3.csv"
3.9,3.9,3.9
3.9,3.9,3.9
3.85,3.85,3.85
3.08,3.08,3.08
...
You can use the usecols argument of pandas.read_csv.
import pandas as pd
from glob import glob
So what we are doing here is that we are looping over all files in the current directory that end with .csv and then for each of those files only read in the column of interest, i.e. the 5th column. We write usecols=[4] because pandas uses 0-based indexing, so out of 0, 1, 2, 3, 4, the fifth number is 4. Additionally you asked to skip blank lines and your sample data contains 9 blank lines leading up to actual data, so we will set skiprows to 9.
We concatenate all of those into one DataFrame using pd.concat.
combined_df = pd.concat(
[
pd.read_csv(csv_file, usecols=[4], skiprows=9)
for csv_file in glob('*.csv')
]
)
To get rid of blank lines from your DataFrame, you can simply use:
combined_df = combined_df.dropna()
This combined_df we can then simply write to file:
combined_df.to_csv('combined_column_5.csv')

add a column to csv file in python based on other columns [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am using Python 3.6.2 and have the following csv file:
STATE,RATE,DEATHS
IA,4.2,166
NH,4.2,52
MA,4.3,309
CA,4.4,2169
CO,4.6,309
ID,4.6,106
NY,4.6,1087
VT,4.6,27
NJ,4.7,487
I am trying to add a new column to the file, where I multiply the rate column times the deaths column. The following table is what I'd like my results to look like.
STATE,RATE,DEATHS,NEW
IA,4.2,166,697.2
NH,4.2,52,218.2
MA,4.3,309,1328.7
CA,4.4,2169,9543.6
CO,4.6,309,1421.4
ID,4.6,106,487.6
NY,4.6,1087,5000.2
VT,4.6,27,124.2
NJ,4.7,487,2288.9
I've tried looking for an answer to this question but couldn't find anything similar to this. Thanks in advance.
Use pandas:
import pandas as pd
df = pd.read_csv('path/to/yourfile.csv')
df['NEW'] = df.RATE * df.DEATHS
df.to_csv('path/to/yournewfile.csv')
Using the pandas library, this is fairly simple:
import pandas as pd
df = pd.read_csv('filename.csv')
df['NEW'] = df['RATE'] * df['DEATHS']
# You can save over the old file, though I would suggest saving a new one
# in case you make a mistake
df.to_csv('new_filename.csv')
There are several cool things that the pandas library takes care of for us. First, we easily parse the csv using the pd.read_csv() statement. Next, pandas DataFrame objects (which is what the variable df is) allow us to use keys to access and create columns, much like a Python dictionary. When we perform mathematical operations using columns from the DataFrame, the pandas library actually performs the operation for each value in each column, so in our example, the index 0 in the 'RATE' column is multiplied by index 0 of the 'DEATHS' column.
In short, if you are going to access and manipulate spreadsheet-like files in python, pandas is a powerful and easy-to-use library.
file = open('test.csv', 'r')
lines = file.readlines()
# print new header
print(lines[0].strip() + ',NEWCOLUMN')
# loop through other lines starting from 1
for line in lines[1:]:
line_items = line.strip().split(',')
# your operation
new_column = float(line_items[1]) * float(line_items[2])
line_items.append(new_column)
print(",".join(map(str, line_items)))
You can read the csv with csv built-in package and then manipulate with columns as you need. Of course, you can use pandas library, but like to use a sledge-hammer to crack a nut. Replace StringIO (just to make testing simple) in the example below with the file reading and the job is done.
from io import StringIO
import csv
f_in = StringIO("""STATE,RATE,DEATHS
IA,4.2,166
NH,4.2,52
MA,4.3,309
CA,4.4,2169
CO,4.6,309
ID,4.6,106
NY,4.6,1087
VT,4.6,27
NJ,4.7,487""")
reader = csv.reader(f_in)
with open('new.csv', 'w') as f:
writer = csv.writer(f)
headings = next(reader)
headings.append('NEW')
writer.writerow(headings)
for row in reader:
row.append(str(round(float(row[1]) * float(row[2]), 1)))
writer.writerow(row)

Importing csv file into python and creating a table in python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I want to import csv file into python and then create a table in python to display the contents of the imported csv file.
Further need to do manipulations on the data present in the table.
More functions related to table in python should be performed further:
Like:
1) highlighting the specified column using python
2) doing modifications with particular column like sorting data as per the date or quantity using python
This is an example if you want to import a csv file or txt file with Python.
I use Numpy in order to make that :
#!/usr/bin/env python
import numpy as np
file = np.loadtxt('filename', delimiter=',') # Put the delimiter type from your csv file : coma, blank, ..
print file # print the numpy array
If you want to make a sort in your array, you can use the numpy function :
np.sort()
Doc is there
Now, try to make something, because it's important to get a script before to post your ask ;)

Categories