Finding a specific value in csv files Python - python

I have a column of values, which are part of a dataframe df.
Value
6.868061881
6.5903628020000005
6.472865833999999
6.427754219
6.40081742
6.336348032
6.277545389
6.250755132
These values have been put together from several CSV files. Now I'm trying to backtrack and find the original CSV file which contains the values. This is my code. The problem is each row of the CSV file contains alphanumeric entries and I'm comparing only for numeric ones (as Values above). So the code isn't working.
for item in df['Value']:
for file in dirs:
csv_file = csv.reader(open(file))
for row in csv_file:
for column in row:
if str(column) == str(item):
print (file)
Plus, I'm trying to optimize the # loops. How do I approach this?

Assuming dirs is a list of file paths to CSV files:
csv_dfs = {file: pd.read_csv(file) for file in dirs}
csv_df = pd.concat(csv_dfs)
If you're just looking in the 'Values' column, this is pretty straightforward:
print csv_df[csv_df['Values'].isin(df['Values'])]
Because we made the dataframe from a dictionary of the files, where the keys are filenames, the printed values will have the original filename in the index.
In a comment, you asked how to just get the filenames. Because of the way we constructed the dataframe's index, the following should work to get a series of the filenames:
csv_df[csv_df['Values'].isin(df['Values'])].reset_index()['level_0']
Note, if you're not sure what column in the CSVs you're matching, then you can loop it:
for col in df.columns:
print csv_df[csv_df[col].isin(df['Values'])]

A few suggestions:
Make sure you're comparing like types, e.g.:
if str(column) == str(item):
Or, you could check types before doing the comparison:
if all(map(type,[column,item])) and column == item:
Or, dump your CSV into a DataFrame. This approach reduces the number of loops since you don't need to iterate the rows/lines in the file, just the columns:
from pandas import read_csv
for item in df['Value']:
for file in dirs:
csv_frame = read_csv(file)
for column in csv_frame.columns:
if item in csv_frame[column]:
print(file)

File I/O will generally take more time than processing data in memory. So, if you want to optimize your code , it will be better to loop through the csv files once, instead of for every item in your dataframe. I suggest the following -
val_list = df['Values'].values
for file in dirs:
csv_df = pd.read_csv(file)
df_contains = csv_df.isin(val_list)
if np.any(df_contains.values):
print(file)

Related

Reading, calculate and group data of several files with pandas

I'm trying to make a small script to automate something at my work. I have a ton of text files that I need to group into a large dataframe to plot after.
The files have this general structure like this
5.013130280 4258.0
5.039390845 4198.0
... ...
49.944957015 858.0
49.971217580 833.0
What I want to do is
Keep the first column as the column of the final dataframe (as these values are the same for all files)
The rest of the dataframe is just extracting the second column of each file, normalize it and group everything together.
Use the file name as the header for extracted column (from point to) to use after in the plotting of the data
Right I was able to only make step 2, here is the code
import os
import pandas as pd
import glob
path = "mypath"
extension = 'xy'
os.chdir(path)
dir = os.listdir(path)
files = glob.glob(path + "/*.xy")
li = []
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df['int_n']=data['int']/data['int'].max()
li_norm.append(df['int_n'])
norm_files = pd.concat(li_norm, axis = 1)
So is there any way to solve this in an easy way?
Assuming that all of your files have exactly the same length (# of rows) and values for angles, then you don't really need to make a bunch of dataframes and concatenate them all together.
If I'm understanding correctly, you just want a final dataframe with a new column for each file (named with the filename) with the 'int' data, normalized with all the values from only that specific file
On the first file, you can create a dataframe to use as your final output, then just add columns to it on each subsequent file
for idx,file in enumerate(files):
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
filename = file.split('\\')[-1][:-3] #get filename from splitting full path and removing last 3 characters (file extension)
df[filename]=df['int']/df['int'].max() #use the filename itself as the new column name
if idx == 0: #create norm_files output dataframe on first file
norm_files = df[['angle',file]]
else: #add column to norm_files for all subsequent files
norm_files[file] = df[file]
You can add a calculated column quite simply, although I'm not sure if that's what you're asking.
for file in files:
df = pd.read_csv(file, names=('angle','int'), delim_whitespace=True)
df[file.split('.')[0]]=data['int']/data['int'].max()
li_norm.append(df['int_n'])

How to create a dataframe from multiple csv files?

I am loading a csv file in pandas as
premier10 = pd.read_csv('./premier_league/pl_09_10.csv')
However, I have 20+ csv files, which I was hoping to load as separate dfs (one df per csv) using a loop and predefined names, something similar to:
import pandas as pd
file_names = ['pl_09_10.csv','pl_10_11.csv']
names = ['premier10','premier11']
for i in range (0,len(file_names)):
names[i] = pd.read_csv('./premier_league/{}'.format(file_names[i]))
(Note, here I provide only two csv files as example) Unfortunately, this doesn't work (no error messages, but the the pd dfs don't exist).
Any tips/links to previous questions would be greatly appreciated as I haven't found anything similar on Stackoverflow.
Use pathlib to set a Path, p, to the files
Use the .glob method to find the files matching the pattern
Create a dataframe with pandas.read_csv
Use a dict comprehension to create a dict of dataframes, where each file will have its own key-value pair.
Use the dict like any other dict; the keys are the file names and the values are the dataframes.
Alternatively, use a list comprehension with pandas.concat to create a single dataframe from all the files.
In the for-loop in the OP, objects (variables) may not be created in that way (e.g. names[i]).
This is equivalent to 'premier10' = pd.read_csv(...), where 'premier10' is a str type.
from pathlib import Path
import pandas as pd
# set the path to the files
p = Path('some_path/premier_league')
# create a list of the files matching the pattern
files = list(p.glob(f'pl_*.csv'))
# creates a dict of dataframes, where each file has a separate dataframe
df_dict = {f.stem: pd.read_csv(f) for f in files}
# alternative, creates 1 dataframe from all files
df = pd.concat([pd.read_csv(f) for f in files])
names = ['premier10','premier11'] does not create a dictionary but a list. Simply replace it with names = dict() or replace names = ['premier10','premier11'] by names.append(['premier10','premier11'])
This is what you want:
#create a variable and look through contents of the directory
files=[f for f in os.listdir("./your_directory") if f.endswith('.csv')]
#Initalize an empty data frame
all_data = pd.DataFrame()
#iterate through files and their contents, then concatenate their data into the data frame initialized above
for file in files:
df = pd.read_csv('./your_directory' + file)
all_data = pd.concat([all_data, df])
#Call the new data frame and verify that contents were transferred
all_data.head()

How to select unique values from named column in multiple .csv files?

I am trying to create a list of unique ID's from multiple csvs.
I have around 80 csvs containing data, all in the same format and in the same directory. The files contain time series data from around 1500 sites, but not all sites are in all files. The column with the data I need is called 'Site Id'.
I can get unique values from the first csv by creating a dataframe, but I can't see how to loop through all the remaining files.
If it's not obvious by now I am a complete beginner and my tutors are on vacation!
I've tried creating a df for a single file, but I can't figure out the next step.
df = pd.read_csv(r'C:filepathhere.csv')
ids = df['Site Id'].unique().tolist()
You can do something like this. I used the os.listdir function to get all of the files, and then the list.extend to merge the site IDs I was coming across into my siteIDs list. Finally, turning a list into a set, and then back into a list will remove any duplicate entries.
siteIDs = []
directoryToCSVs = r'c:\...'
for filename in os.listdir(directoryToCSVs):
if filename.lower().endswith('.csv'):
df = pd.read_csv(r'C:filepathhere.csv')
siteIDs.extend( df['Site Id'].tolist() )
#remove duplicate site IDs
siteIDs = list(set(siteIds))
#siteIDs will now contain a list of the unique site IDs across all of your CSV files.
You could do something like this to iterate over all your CSVs and load them into dataframes:
from os import walk, path
import pandas as pd
path = 'Path to CSV dir'
csv_paths = []
for root, dirs, files in walk(path):
for c in glob(path.join(root, '*.csv')):
csv_paths.append(c)
for file_path in csv_paths:
df = pd.read_csv(filepath_or_buffer=file_path)
# do something with df (append, export, etc.)
First you need to gather the files into a list that you will be getting data out of. There are many ways to do this, assuming you know the directory they are all in, see this answer for many options.
from os import walk
f = []
for (dirpath, dirnames, filenames) in walk(mypath):
f.extend(filenames)
break
Then within that list you'll need to gather those unique values that you need. Without using Pandas, since it doesn't seem like you actually need your information in a dataframe:
import csv
unique_data = {}
for file in f:
with open(file, 'rU') as infile:
reader = csv.DictReader(infile)
for row in reader:
# go through each, add value to dictionary
for header, value in row.items():
unique_data[value] = 0
# unqiue_data.keys() is now your list of unique values, if you want a true list
unique_data_list = list(unqiue_data.keys())

Read in multiple csv into separate dataframes in Pandas

I have a long list of csv files that I want to read as dataframes and name them by their file name. For example, I want to read in the file status.csv and assign its dataframe the name status. Is there a way I can efficiently do this using Pandas?
Looking at this, I still have to write the name of each csv in my loop. I want to avoid that.
Looking at this, that allows me to read multiple csv into one dataframe instead of many.
You can list all csv under a directory using os.listdir(dirname) and combine it with os.path.basename to parse the file name.
import os
# current directory csv files
csvs = [x for x in os.listdir('.') if x.endswith('.csv')]
# stats.csv -> stats
fns = [os.path.splitext(os.path.basename(x))[0] for x in csvs]
d = {}
for i in range(len(fns)):
d[fns[i]] = pd.read_csv(csvs[i])
you could create a dictionary of DataFrames:
d = {} # dictionary that will hold them
for file_name in list_of_csvs: # loop over files
# read csv into a dataframe and add it to dict with file_name as it key
d[file_name] = pd.read_csv(file_name)

Why is my for loop overwriting instead of appending?

I have multiple (25k) .csv files that I'm trying to append into a HDFStore file. They all share identical headers. I am using the below code, but for some reason whenever I run it the dataframe isn't appended with all of the files, but rather is only the last file in the list.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
store = pd.HDFStore('store.h5')
store.put('df', pd.read_csv(filenames[0],dtype=dtypes,parse_dates=
["date"])) #store one data frame
for f in filenames:
try:
temp_csv = pd.DataFrame()
temp_csv = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
store.append('df', temp_csv)
except:
pass
I've tried using a subset of the filenames list, but always get the last entry. For some reason, the loop is not appending my file, but rather overwriting it every single time. Any advice would be appreciated as this is driving me bonkers. (python 3, windows)
I think the problem is related to:
store.append('df', temp_csv)
If I correctly understand what you're trying to do, 'df' should change every iteration, you're just overwriting it now.
You're creating/storing a new DataFrame with each iteration, like #SeaMonkey said. Your consolidated dataframe should be instantiated outside your loop, something like this.
filenames = [] #list of .csv file paths that I've alredy populated
dtypes= {dict of datatypes}
df = pd.DataFrame()
for f in filenames:
df_tmp = pd.read_csv(f,dtype=dtypes,parse_dates=["trade_date"])
df = df.append(df_tmp)
store = pd.HDFStore('store.h5')
store.put('df', df)

Categories