Could have sworn that I was able to serialize and read back a dataframe with a column containing lists before.
When calling df.to_csv(path) the column containing lists are indeed lists
When calling pd.read_csv(path) the column previously containing lists are now strings, but need to be lists again.
I've written a converter argument to handle it, but I'd like to know if there is better way. I've tried astype() with np.ndarray, list and 'O' with no luck.
Anyone know of a 'built-in' way of handling this?
Related
My code below creates a dataframe from lists of columns from other dataframes. I'm getting an error when calling a list that is produce by a set. How can I treat that set of list, in order to add those columns to my dataframe?
Error produce by +list(matchedList)
#extract columns that need to be conform
datasetMatched = dataset.select(selectedColumns +list(matchedList))
#display(datasetMatched)
TypeError: 'list' object is not callable
It probably happens due to shadowing the builtin list function. Make sure you didn't define any variable named list in your code.
I created a script to go over the needed data, using pandas.
I'm now receiving more files that I need to go over, and sadly these files do not have the same headers.
For example I have placed in my list of columns to use 'id_num' and in some of the files it appears as 'num_id'.
Is it possible to still use the usecols list I created, and allow certain elements in it to "connect" with different header strings, for example by using regex?
I assume you're referring to the usecols keyword in pd.read_csv (or some analogous pandas reading)? I'm sure you've gathered that pandas can't do a regex search on a dataframe before it even read the dataframe so I'm fairly certain doing a regex search with the usecols keyword isn't feasible.
However, after you read the csv into a dataframe (let's name it df for the sake of the example), you could very easily filter the columns of interest using regexes.
for example, suppose your new dataframe is loaded into df:
potential_columns = ['num_id', 'id_num']
df_cols = [col for col in df.columns if re.search('|'.join(potential_columns), col)]
You could list all potential columns you want to search for with potential_columns. Then using join create one massive regex search. Then use a list comprehension to aggregate all valid columns in df.columns. Once that's done you can finish this process by calling:
df = df[df_cols]
Dealing with duplicate columns, creating clever keywords to search for is left as an exercise for you.
I'm tearing my hair out a bit with this one. I've imported two csv's into pandas dataframes both have a column called SiteReference i want to use pd.merge to join dataframes using SiteReference as a key.
Initial merged failed as pd.read took different interpretations of the SiteReference values, in one instance 380500145.0 in the other 380500145 both stored as objects. I ran Regex to clean the columns and then pd.to_numeric, this resulted in one value of 380500145.0 and another of 3.805001e+10. They should both be 380500145. I then attempted;
df['SiteReference'] = df['SiteReference'].astype(int).astype('str')
But got back;
ValueError: cannot convert float NaN to integer
How can i control how pandas is dealing with these, preferably on import?
Perharps the best solution is to avoid that pd.read affect the type of this field :
df=pd.read_csv('data.csv',sep=',',dtype={'SiteReference':str})
Following the discussion in the comments, if you want to format floats as integer strings, you can use this:
df['SiteReference'] = df['SiteReference'].map('{:,.0f}'.format)
This should handle null values gracefully.
I have a folder that contains ~90 CSV files. Each relevant file is named xxxxx-2012 and has the same column names.
I would like to create a single DataFrame with a specific column power(MW) from each file, i.e. 90 columns in total, naming the column in the resulting DataFrame by the file name.
My objective with problems like this is to get to a simple datastructure as quickly as possible. In this case, that could be a dictionary of filenames to DataFrames.
frames = {filename: pd.read_csv(filename) for filename is os.listdir()}
You may have to filter out bad filenames, e.g. by extension, or you may be better off using glob... in either case it breaks up the problem, this shouldn't be too bad.
Then the question becomes much easier*:
How do I get one column from a DataFrame. df[colname].
How do I concat a list of columns to a DataFrame.
*Assuming you know your way around python datastructure e.g. list comprehensions.
Another option is to just concat the entire dict:
pd.concat(frames)
(which gives you a MultiIndex with all the information.)
I am using pandas to read a csv file. The data are numbers but stored in the csv file as text. Some of the values are non-numeric when they are bad or missing. How do I filter out these values and convert the remaining data to integers.
I assume there is a better/faster way than looping over all the values and using isdigit() to test for them being numeric.
Does pandas or numpy have a way of just recognizing bad values in the reader? If not, what is the easiest way to do it? Do I have to specific the dtypes to make this work?
pandas.read_csv has the parameter na_values:
na_values : list-like, default None
List of additional strings to recognize as NA/NaN
where you can define these bad values.
You can pass a custom list of values to be treated as missing using pandas.read_csv . Alternately you can pass functions to the converters argument.
NumPy provides the function genfromtxt() specifically for this purpose. The first sentence from the linked documentation:
Load data from a text file, with missing values handled as specified.