Saving extracted column to a txt file in ascending order

Saving extracted column to a txt file in ascending order - python

I need some help writing the values from a column to a text file in ascending order.
The code I currently have creates a directory called values and saves the values extracted from the column to .txt file but it is not in ascending order as I would like.
values_dir=os.path.join(cwd, 'values')
if not os.path.exists(values_dir):
os.mkdir(values_dir)
with open(os.path.join(values_dir, 'values.txt'), "w") as txt_file:
for name, group in split_location:
txt_file.write(str(name) + '\n')
The code saves my values as
data23
data17
data88
I would like it to save as
data17
data23
data88
If someone could point me in the right direction would be much appreciated, thank you.
Edit
I split 2 large dataframes by unique values in fields Data and Data_Unit
datafile = pd.read_csv('location.csv')
datafile_large = pd.read_csv('large.csv')
split_location = datafile.groupby('Data')
split_large = datafile_large.groupby('Data_Unit')
I then loop through the groups and save the split dataframes to sub-directories based on their unique values, whilst maintaining the parent file name.
for name, group in split_location:
sub_dir = os.path.join(cwd, name)
if not os.path.exists(sub_dir):
os.mkdir(sub_dir)
group = group.drop(['Data'], axis=1)
group.to_csv(sub_dir + "/location.csv", index=0)
for name, group in split_large:
sub_dir = os.path.join(cwd, name)
if not os.path.exists(sub_dir):
os.mkdir(sub_dir)
group = group.drop(['Data_Unit'], axis=1)
group.to_csv(sub_dir + "/large.csv", index=0)
Lastly I create the values.txt file as mentioned in the beginning. But would like the values saved in the .txt file in ascending order.
values_dir=os.path.join(cwd, 'values')
if not os.path.exists(values_dir):
os.mkdir(values_dir)
with open(os.path.join(values_dir, 'values.txt'), "w") as txt_file:
for name, group in split_location:
txt_file.write(str(name) + '\n')

Try this:
names, groups = map(list, zip(*split_location))
names.sort()
for name in names:
txt_file.write(str(name) + '\n')
Instead of:
for name, group in split_location:
txt_file.write(str(name) + '\n')

You can use python's built-in sorted function or the sort method of a list. Another answer shows the sort method so I'm using sorted here.
Also use pathlib for python 3.
from pathlib import Path
values_dir = Path.home() / 'values'
values_dir.mkdir(exist_ok=True)
# step one: get a list of names
# from your example, split_location looks like
# an iterable of two-item tuple or list
names = sorted([str(item[0]) for item in split_location])
# step two: write the list of sorted names
# you can write just one string by joining your
# list of names with newline characters
newf = values_dir / 'values.txt'
newf.write_text('\n'.join(names))

Related

Compare multiple CSV files by row and delete files not needed

I am comparing multiple CSV files against a master file by a selected column values, and want to keep only the file that has the most matches with the master file.
The code I actually created give me the results for each file, but I don't know how to make the comparison between the files themselves, and just keep the one with the highest values sum at the end.
I know how to delete files via os.remove() and so on, but need help with the selection of the maximum value.
data0 = pd.read_csv('input_path/master_file.csv', sep=',')
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
for df in csv_files:
df_base = os.path.basename(df)
input_dir = os.path.dirname(df)
data1 = pd.read_csv(df, sep=',')
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
sum = str(sum(match1))
print('Matches between ' + df_base + ' & ' + input_dir + ': ' + sum)
The print gives me (paths and directories names appear correct):
Matches between ... & ...: 332215
Matches between ... & ...: 273239
Had the idea to try it via sub-lists, but just did not get anywhere.

You could write a function to calculate the "match score" for each file, and use that function as the key argument for the max function:
def match_score(csv_file):
df_base = os.path.basename(csv_file)
data1 = pd.read_csv(csv_file, sep=",")
comp1 = pd.concat([data0, data1])[['values']]
cnt1 = comp1.loc[comp1.duplicated()]
match1 = cnt1.count(axis=1)
return match1.sum()
Then,
csv_files = glob.glob(fr'path_to_files_in_comparison\**\*.csv', recursive=True)
max_match_file = max(csv_files, key=match_score)

You can simplify your code a lot using pathlib.
Addressing your question, you can store the duplicates sum in a dictionary, and after comparing all files, choose the one with most matches. Something like this:
import pandas as pd
from pathlib import Path
main_file = Path('/main/path/main.csv')
main_df = pd.read_csv(main_file)
other_path = Path('/other/path/')
other_files = other_path.rglob('*.csv')
matches_per_file = {}
for other_file in other_files:
other_df = pd.read_csv(other_file)
merged_df = pd.concat([main_df, other_df])[['values']]
dups = merged_df.loc[merged_df.duplicated()]
dups_sum = sum(dups.count(axis=1))
matches_per_file[other_file] = dups_sum
print(f'Matches between {other_file} and {main_file}: {dups_sum}')
# find the file with most matches
most_matches = max(matches_per_file, key=matches_per_file.get)
The code above will populate matches_per_file with pairs filename: matches. That will make it easy for you to find the max(matches) and the corresponding filename, and then decide which files you will keep and which ones you will delete. The variable most_matches will be set with that filename.
Use the code snippet as a starting point, since I don't have the data files to test it properly.

Thank you for your support. I have built a solution using list and sub-list. I added the following to my code and it works. Probably not the nicest solution, but it's my turn to improve my python skills.
liste1.append(df)
liste2.append(summe)
liste_overall = list(zip(liste1, liste2))
max_liste = max(liste_overall, key=lambda sublist: sublist[1])
for df2 in liste_overall:
#print(max_liste)
print(df2)
if df2[1] in max_liste[1]:
print("Maximum duplicated values, keep file!")
else:
print("Not maximum duplicated, file is removed!")
os.remove(df2[0])

How can one iterate through files in folder in natural sort order using glob.glob(path)?

I am currently trying to do something very basic: compute the sum of two cells in a .csv file and output it into a new DataFrame. I then am repeating this for multiple rows in that .csv file, and multiple files in a folder. After all this, I am outputting the DataFrame to a .xlsx file. Main body of code is below:
for fname in glob.glob(path):
print(fname)
processed = []
df = pd.read_csv(fname)
for index, row in df.iterrows():
processed.append(row['Rejected'] + row['Sorted'])
heatMap[str(counter)] = processed
counter += 1
newfname = 'Output.xlsx'
heatMap.to_excel(newfname)
However, when I look at my newly created DataFrame, the columns are out of order. Inspecting the console, I can see the files are iterated through in a alphanumeric order.
Console output
I was wondering how my method can be adjusted so that I can iterate through the files in a natural sort order (1, 2, 3, 4, 5 etc.), so I don't have to change the name of each file.
Thank you!

for fname in sorted(glob.glob(path)):
...
This makes the glob iteration a list, so that we can sort it using the python sorted keyword. You can then loop through it in alphabetical order.
For natural sort, there is a natsort package.
from natsort import natsorted
for fname in natsorted(glob.glob(path)):
...

Use for loop to create dataframes from a list

Python/Pandas beginner here. I have a list with names which each represent a csv file on my computer. I would like to create a separate pandas dataframe for each of these csv files and use the same names for the dataframes. I can do this in a very inefficient way by creating a separate line of code for each name in the list and adding/removing these lines of code manually as the list changes over time, something like this when I have 3 names Mark, Frank and Peter:
path = 'C:\\Users\\Me\\Desktop\\Names'
Mark = pd.read_csv(path+"Mark.csv")
Frank = pd.read_csv(path+"Frank.csv")
Peter = pd.read_csv(path+"Peter.csv")
Problem is that I will usually have a dozen or so names and they change frequently, so this is not very efficient. Instead I figured I would keep a list of the names to update when needed and use a for loop to do the rest:
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
for name in names:
name = pd.read_csv(path+name+'.csv')
This does not produce an error, but instead of creating 3 different dataframes Mark, Frank and Peter, it creates a single dataframe 'name' using only the data from the first entry in the list. How do make this work so that it creates a separate dataframe for each name in the list and give each dataframe the same name as the csv file that was read?

it creates a single dataframe 'name' using only the data from the first entry in the list.
It uses the last entry, because each time through the loop, name is replaced with the result of the next read_csv call. (Actually, it's being replaced with one of the value from the list, and then with the read_csv result; to avoid confusion, you should use a separate name for your loop variables as your outputs. Especially since name doesn't make any sense as the thing to call your result :) )
How do make this work
You had a list of input values, and thus you want a list of output values as well. The simplest approach is to use a list comprehension, describing the list you want in terms of the list you start with:
csvs = [
pd.read_csv(f'{path}{name}.csv')
for name in names
]
It works the same way as the explicit loop, except it builds a list automatically from the value that's computed each time through. It means what it says, in order: "csvs is a list of these pd.read_csv results, computed once for each of the name values that is in names".

name here is the variable used to iterate over the list. Modifying it won't make any noticable changes.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path + name + '.csv'))
# OR
dfs = [
pd.read_csv(path + name + '.csv')
for name in names
]
Or, you can use a dict to map the name with the file.
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = {}
for name in names:
dfs[name] = pd.read_csv(path + name + '.csv')
# OR
dfs = {
name : pd.read(path + name + '.csv')
for name in names
}

Two options:
If you know the names of all your csv files you can edit you code and only add a list to hold all your files.
Example
path = 'C:\\Users\\Me\\Desktop\\Names'
names = ['Mark','Frank','Peter']
dfs = []
for name in names:
dfs.append(pd.read_csv(path+name+'.csv')
Otherwise, you can look for all the files with csv extension and open all of them using listdir()
import os
import pandas as pd
path = 'C:\\Users\\Me\\Desktop\\Names'
files = os.listdir(path)
dfs = []
for file in files:
if file[-3:] == "csv":
dfs.append(pf.read_csv(path + file))

for name in names:
globals()[name] = pd.read_csv(path+name+'.csv')

Finding a specific value in csv files Python

I have a column of values, which are part of a dataframe df.
Value
6.868061881
6.5903628020000005
6.472865833999999
6.427754219
6.40081742
6.336348032
6.277545389
6.250755132
These values have been put together from several CSV files. Now I'm trying to backtrack and find the original CSV file which contains the values. This is my code. The problem is each row of the CSV file contains alphanumeric entries and I'm comparing only for numeric ones (as Values above). So the code isn't working.
for item in df['Value']:
for file in dirs:
csv_file = csv.reader(open(file))
for row in csv_file:
for column in row:
if str(column) == str(item):
print (file)
Plus, I'm trying to optimize the # loops. How do I approach this?

Assuming dirs is a list of file paths to CSV files:
csv_dfs = {file: pd.read_csv(file) for file in dirs}
csv_df = pd.concat(csv_dfs)
If you're just looking in the 'Values' column, this is pretty straightforward:
print csv_df[csv_df['Values'].isin(df['Values'])]
Because we made the dataframe from a dictionary of the files, where the keys are filenames, the printed values will have the original filename in the index.
In a comment, you asked how to just get the filenames. Because of the way we constructed the dataframe's index, the following should work to get a series of the filenames:
csv_df[csv_df['Values'].isin(df['Values'])].reset_index()['level_0']
Note, if you're not sure what column in the CSVs you're matching, then you can loop it:
for col in df.columns:
print csv_df[csv_df[col].isin(df['Values'])]

A few suggestions:
Make sure you're comparing like types, e.g.:
if str(column) == str(item):
Or, you could check types before doing the comparison:
if all(map(type,[column,item])) and column == item:
Or, dump your CSV into a DataFrame. This approach reduces the number of loops since you don't need to iterate the rows/lines in the file, just the columns:
from pandas import read_csv
for item in df['Value']:
for file in dirs:
csv_frame = read_csv(file)
for column in csv_frame.columns:
if item in csv_frame[column]:
print(file)

File I/O will generally take more time than processing data in memory. So, if you want to optimize your code , it will be better to loop through the csv files once, instead of for every item in your dataframe. I suggest the following -
val_list = df['Values'].values
for file in dirs:
csv_df = pd.read_csv(file)
df_contains = csv_df.isin(val_list)
if np.any(df_contains.values):
print(file)

How to select unique values from named column in multiple .csv files?

I am trying to create a list of unique ID's from multiple csvs.
I have around 80 csvs containing data, all in the same format and in the same directory. The files contain time series data from around 1500 sites, but not all sites are in all files. The column with the data I need is called 'Site Id'.
I can get unique values from the first csv by creating a dataframe, but I can't see how to loop through all the remaining files.
If it's not obvious by now I am a complete beginner and my tutors are on vacation!
I've tried creating a df for a single file, but I can't figure out the next step.
df = pd.read_csv(r'C:filepathhere.csv')
ids = df['Site Id'].unique().tolist()

You can do something like this. I used the os.listdir function to get all of the files, and then the list.extend to merge the site IDs I was coming across into my siteIDs list. Finally, turning a list into a set, and then back into a list will remove any duplicate entries.
siteIDs = []
directoryToCSVs = r'c:\...'
for filename in os.listdir(directoryToCSVs):
if filename.lower().endswith('.csv'):
df = pd.read_csv(r'C:filepathhere.csv')
siteIDs.extend( df['Site Id'].tolist() )
#remove duplicate site IDs
siteIDs = list(set(siteIds))
#siteIDs will now contain a list of the unique site IDs across all of your CSV files.

You could do something like this to iterate over all your CSVs and load them into dataframes:
from os import walk, path
import pandas as pd
path = 'Path to CSV dir'
csv_paths = []
for root, dirs, files in walk(path):
for c in glob(path.join(root, '*.csv')):
csv_paths.append(c)
for file_path in csv_paths:
df = pd.read_csv(filepath_or_buffer=file_path)
# do something with df (append, export, etc.)

First you need to gather the files into a list that you will be getting data out of. There are many ways to do this, assuming you know the directory they are all in, see this answer for many options.
from os import walk
f = []
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
Then within that list you'll need to gather those unique values that you need. Without using Pandas, since it doesn't seem like you actually need your information in a dataframe:
import csv
unique_data = {}
for file in f:
with open(file, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
# go through each, add value to dictionary
for header, value in row.items():
unique_data[value] = 0
# unqiue_data.keys() is now your list of unique values, if you want a true list
unique_data_list = list(unqiue_data.keys())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saving extracted column to a txt file in ascending order - python

Try this: names, groups = map(list, zip(*split_location)) names.sort() for name in names: txt_file.write(str(name) + '\n') Instead of: for name, group in split_location: txt_file.write(str(name) + '\n')

Related

Compare multiple CSV files by row and delete files not needed

How can one iterate through files in folder in natural sort order using glob.glob(path)?

Use for loop to create dataframes from a list

Finding a specific value in csv files Python

How to select unique values from named column in multiple .csv files?

Categories

Resources