Definitive way to parse alphanumeric CSVs in Python with scipy/numpy

Definitive way to parse alphanumeric CSVs in Python with scipy/numpy - python

I've been trying to find a good and flexible way to parse CSV files in Python but none of the standard options seem to fit the bill. I am tempted to write my own but I think that some combination of what exists in numpy/scipy and the csv module can do what I need, and so I don't want to reinvent the wheel.
I'd like the standard features of being able to specify delimiters, specify whether or not there's a header, how many rows to skip, comments delimiter, which columns to ignore, etc. The central feature I am missing is being able to parse CSV files in a way that gracefully handles both string data and numeric data. Many of my CSV files have columns that contain strings (not of the same length necessarily) and numeric data. I'd like to be able to have numpy array functionality for this numeric data, but also be able to access the strings. For example, suppose my file looks like this (imagine columns are tab-separated):
# my file
name favorite_integer favorite_float1 favorite_float2 short_description
johnny 5 60.2 0.52 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
data = loadcsv('myfile.csv', delimiter='\t', parse_header=True, comment='#')
I'd like to be able to access data in two ways:
As a matrix of values: it's important for me to get a numpy.array so that I can easily transpose and access the columns that are numeric. In this case, I want to be able to do something like:
floats_and_ints = data.matrix
floats_and_ints[:, 0] # access the integers
floats_and_ints[:, 1:3] # access some of the floats
transpose(floats_and_ints) # etc..
As a dictionary-like object where I don't have to know the order of the headers: I'd like to also access the data by the header order. For example, I'd like to do:
data['favorite_float1'] # get all the values of the column with header
"favorite_float1"
data['name'] # get all the names of the rows
I don't want to have to know that favorite_float1 is the second column in the table, since this might change.
It's also important for me to be able to iterate through the rows and access the fields by name. For example:
for row in data:
# print names and favorite integers of all
print "Name: ", row["name"], row["favorite_int"]
The representation in (1) suggest a numpy.array, but as far as I can tell, this does not deal well with strings and requires me to specify the data type ahead of time as well as the header labels.
The representation in (2) suggests a list of dictionaries, and this is what I have been using. However, this is really bad for csv files that have two string fields but the rest of the columns are numeric. For the numeric values, you really do want to be able to sometime get access to the matrix representation and manipulate it as a numpy.array.
Is there a combination of csv/numpy/scipy features that allows the flexibility of both worlds? Any advice on this would be greatly appreciated.
In summary, the main features are:
Standard ability to specify delimiters, number of rows to skip, columns to ignore, etc.
The ability to get a numpy.array/matrix representation of the data so that it can numeric values can be manipulated
The ability to extract columns and rows by header name (as in the above example)

Have a look at pandas which is build on top of numpy.
Here is a small example:
In [7]: df = pd.read_csv('data.csv', sep='\t', index_col='name')
In [8]: df
Out[8]:
favorite_integer favorite_float1 favorite_float2 short_description
name
johnny 5 60.20 0.520 johnny likes fruitflies
bob 1 17.52 0.001 bob, bobby, robert
In [9]: df.describe()
Out[9]:
favorite_integer favorite_float1 favorite_float2
count 2.000000 2.000000 2.000000
mean 3.000000 38.860000 0.260500
std 2.828427 30.179317 0.366988
min 1.000000 17.520000 0.001000
25% 2.000000 28.190000 0.130750
50% 3.000000 38.860000 0.260500
75% 4.000000 49.530000 0.390250
max 5.000000 60.200000 0.520000
In [13]: df.ix['johnny', 'favorite_integer']
Out[13]: 5
In [15]: df['favorite_float1'] # or attribute: df.favorite_float1
Out[15]:
name
johnny 60.20
bob 17.52
Name: favorite_float1
In [16]: df['mean_favorite'] = df.mean(axis=1)
In [17]: df.ix[:, 3:]
Out[17]:
short_description mean_favorite
name
johnny johnny likes fruitflies 21.906667
bob bob, bobby, robert 6.173667

matplotlib.mlab.csv2rec returns a numpy recarray, so you can do all the great numpy things to this that you would do with any numpy array. The individual rows, being record instances, can be indexed as tuples but also have attributes automatically named for the columns in your data:
rows = matplotlib.mlab.csv2rec('data.csv')
row = rows[0]
print row[0]
print row.name
print row['name']
csv2rec also understands "quoted strings", unlike numpy.genfromtext.
In general, I find that csv2rec combines some of the best features of csv.reader and numpy.genfromtext.

numpy.genfromtxt()

Why not just use the stdlib csv.DictReader?

Related

How to keep all trailing zeros of multiple columns of a Dataframe when each column has a different size of numbers?

Important initial information: these values are ID's, they are not calculation results, so I really don't have a way to change the way they are saved in the file.
Dataframe example:
datetime
match_name
match_id
runner_name
runner_id
...
2022/01/01 10:10
City v Real Madrid
1.199632310
City
122.23450
...
2021/01/01 01:01
Celtic v Rangers
1.23410
Rangers
101.870
...
But the match_id in the Dataframe appears:
1.19963231
1.2341
And runner_id in the Dataframe appears:
122.2345
101.87
I tried to pass all values as string so it would see the numbers as string and not remove the zeros:
df = pd.read_csv(filial)
df = df.astype(str)
But it didn't help, he kept removing the zero on the right.
I am aware of the existence of float_format but in this case it is necessary to specify the number of decimal places to be used, so I could not use it and as they are ID's I cannot take the risk of a very large value being rounded.
Note: there are hundreds of different columns.

By the time your data is read, the zeros are already removed, so your conversion to str can no longer help.
You need to pass the option directly to read_csv():
df = pd.read_csv(filial, dtype={'runner_id': str})
If you have many columns like this, you can set dtype=str (instead of a dictionary), but then all your columns will be str, so you need to re-parse each of the interesting ones as their correct dtype (e.g. datetime).
More details in the docs ; maybe play with converters param too.

Pandas group by values in list (in series)

I am trying to group by items in a list in DataFrame Series. The dataset being used is the Stack Overflow 2020 Survey.
The layout is roughly as follows:
... LanguageWorkedWith ... ConvertedComp ...
Respondent
1 Python;C 50000
2 C++;C 70000
I want to essentially want to use groupby on the unique values in the list of languages worked with, and apply the a mean aggregator function to the ConvertedComp like so...
LanguageWorkedWith
C++ 70000
C 60000
Python 50000
I have actually managed to achieve the desired output but my solution seems somewhat janky and being new to Pandas, I believe that there is probably a better way.
My solution is as follows:
# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')
# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')
# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)
# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)
# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')
# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]
# Group by LLW and apply median
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()
Output:
LWW
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: Salary, dtype: float64
which matches the value calculated when choosing specific language: sos.loc[sos["LanguageWorkedWith"].str.contains('Perl').fillna(False), "ConvertedComp"].median()
Any tips on how to improve/functions that provide this functionality/etc would be appreciated!

In the target column only data frame, decompose the language name and combine it with the salary. The next step is to convert the data from horizontal format to vertical format using melt. Then we group the language names together to get the median. melt docs
lww = sos[["LanguageWorkedWith","ConvertedComp"]]
lwws = pd.concat([lww['ConvertedComp'], lww['LanguageWorkedWith'].str.split(';', expand=True)], axis=1)
lwws.reset_index(drop=True, inplace=True)
df_long = pd.melt(lwws, id_vars='ConvertedComp', value_vars=lwws.columns[1:], var_name='lang', value_name='lang_name')
df_long.groupby('lang_name')['ConvertedComp'].median().sort_values(ascending=False).head()
lang_name
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: ConvertedComp, dtype: float64

set value in pandas not knowing if column and row exists, avoiding NaN

I am building a shallow array in pandas containing pair of values (concept - document)
doc1 doc2
concept1 1 0
concept2 0 1
concept3 1 0
I parse a XML file and get pairs (concepts - doc)
every time a new pair comes in I add it to the pandas.
Since the pairs coming in might or might not contain values already present in the rows and/or columns (whatever new concept or whatever new column) I use the following code:
onp=np.arange(1,21,1).reshape(4,5)
oindex=['concept1','concept2','concept3','concept4',]
ohead=['doc1','doc2','doc3','doc5','doc6']
data=onp
mydf=pd.DataFrame(data,index=oindex, columns=ohead)
#... loop ...
mydf.loc['conceptXX','ep8']=1
it works well, only that the value in the data frame is 1.0 and not 1, boolean, and when a new row and/or column is added then the rest of the values are NaN. How can I avoid that. All the values added should be 0 or 1. (Note: the intention is to have also some columns for calculations, so I can not transform just all the dataframe into boolean type for instance:
mydf=mydf.astype(object)
thanks.
SECOND EDIT AFTER ALollz COMMENT
More explanation of the real problem.
I have an XML file that gives me the data in the following way:
<names>
<name>michael</name>
<documents>
<document>doc1</document>
<document>doc2</document>
</documents>
</name>
<name>mathieu</name>
<documents>
<document>doc1</document>
<document>docN</document>
</documents>
</name>
</names>
...
I want to pass this data to a dataframe to make calculations. Basically there are names that appear in different documents when parsing the XML with:
tree = ET.parse(myinputFile)
root = tree.getroot()
I am going adding one by one new values into the dataframe.
When adding sometimes a name is already present in the dataframe, but a new doc has to be added and viceversa.
I hope to have clarify a bit
I was about to write this as solution:
mydf.fillna(0, inplace=True)
mydf=mydf.astype(int)
changing all the NaN values by 0 and then convert them into int to avoid floats.
that has a negative side because i might want to have some columns with text data. in that case an error occur.

Create Multiple New Columns Based on Pipe-Delimited Column in Pandas

I have a pandas dataframe with a pipe delimited column with an arbitrary number of elements, called Parts. The number of elements in these pipe-strings varies from 0 to over 10. The number of unique elements contained in all pipe-strings is not much smaller than the number of rows (which makes it impossible for me to manually specify all of them while creating new columns).
For each row, I want to create a new column that acts as an indicator variable for each element of the pipe delimited list. For instance, if the row
...'Parts'...
...'12|34|56'
should be transformed to
...'Part_12' 'Part_34' 'Part_56'...
...1 1 1...
Because they are a lot of unique parts, these columns are obviously going to be sparse - mostly zeros since each row only contains a small fraction of unique parts.
I haven't found any approach that doesn't require manually specifying the columns (for instance, Pandas Dataframe: split column into multiple columns, right-align inconsistent cell entries).
I've also looked at pandas' melt, but I don't think that's the appropriate tool.
The way I know how to solve it would be to pipe the raw CSV to another python script and deal with it on a char-by-char basis, but I need to work within my existing script since I will be processing hundreds of CSVs in this manner.
Here's a better illustration of the data
ID YEAR AMT PARTZ
1202 2007 99.34
9321 1988 1012.99 2031|8942
2342 2012 381.22 1939|8321|Amx3

You can use get_dummies and add_prefix:
df.Parts.str.get_dummies().add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 1 1 1
Edit for comment and counting duplicates.
df = pd.DataFrame({'Parts':['12|34|56|12']}, index=[0])
pd.get_dummies(df.Parts.str.split('|',expand=True).stack()).sum(level=0).add_prefix('Part_')
Output:
Part_12 Part_34 Part_56
0 2 1 1

Merge misaligned pandas dataframes

I have around 100 csv files. Each of them are written in to its own pandas dataframe and then merged later on and finally being written in to a database.
Each csv file contains a 1000 rows and 816 columns.
Here is the problem:
Each of the csv files contains the 816 columns but not all of the columns contains data. As a result of this some of the csv files are misaligned - the data has been moved left, but the column has not been deleted.
Here's an made up example:
CSV file A (which is correct):
Name Age City
Joe 18 London
Kate 19 Berlin
Math 20 Paris
CSV file B (with misaglignment):
Name Age City
Joe 18 London
Kate Berlin
Math 20 Paris
I would like to merge A and B, but my current solution results in a misalignment.
I'm not sure whether this is easier to deal with in SQL or Python, but I hoped some of you could come up with a good solution.
The current solution to merge the dataframes is as follows:
def merge_pandas(csvpaths):
list = []
for path in csvpaths:
frame = pd.read_csv(sMainPath + path, header=0, index_col = None)
list.append(frame)
return pd.concat(list)
Thanks in advance.

A generic solutions for these types of problems is most likely overkill. We note that the only possible mistake is when a value is written into a column to the left from where it belongs.
If your problem is more complex than the two column example you gave, you should have an array that contains the expected column type for your convenience.
types = ['string', 'int']
Next, I would set up a marker to identify flaws:
df['error'] = 0
df.loc[df.City.isnull(), 'error'] = 1
The script can detect the error with certainty
In your simple scenario, whenever there is an error, we can simply check the value in the first column.
If it's a number, ignore and move on (keep NaN on second value)
If it's a string, move it to the right
In your trivial example, that would be
def checkRow(row):
try:
row['Age'] = int(row['Age'])
except ValueError:
row['City']= row['Age']
row['Age'] = np.NaN
return row
df.apply(checkRow, axis=1)
In case you have more than two columns, use your types variable to do iterated checks to find out where the NaN belongs.
The script cannot know the error with certainty
For example, if two adjacent columns are both string value. In that case, you're screwed. Use a second marker to save these columns and do it manually. You could of course do advanced checks (it should be a city name, check whether the value is a city name), but this is probably overkill and doing it manually is faster.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.