converting column names to integer with read_csv - python

I have constructed a matrix with integer values for columns and index. The matrix is acutally hierachical for each month. My problem is that the indexing and selecting of data does not work anymore as before when I write the data to csv and then load as pandas dataframe.
Selecting data before writing and reading data to file:
matrix.ix[1][4][3] would for example give 123
In words select, month January and get me the (travel) flow from origin 4 to destination 3.
After writing and reading the data to csv and back into pandas, the original referencing fails but if I convert the column indexing to string it works:
matrix.ix[1]['4'][3]
... the column names have automatically been tranformed from integer into string. But I would prefer the original indexing.
Any suggestions?
My current quick fix for handling the data after loading from csv is:
#Writing df to file
mulitindex_df_Travel_monthly.to_csv(r'result/Final_monthly_FlightData_countrylevel_v4.csv')
#Loading df from csv
test_matrix = pd.read_csv(filepath_inputdata+'/Final_monthly_FlightData_countrylevel_v4.csv',
index_col=[0, 1])
test_matrix.rename(columns = int, inplace = True) #Thx, #ayhan
CSV FILE:
https://www.dropbox.com/s/4u2opzh65zwcn81/travel_matrix_SO.csv?dl=0

I used something like this:
df = df.rename(columns={str(c): c for c in columns})
where:
df is pandas dataframe and columns are column to change

You could also do
df.columns = df.columns.astype(int)
or
df.columns = df.columns.map(int)
Related: what is difference between .map(str) and .astype(str) in dataframe

Related

Pandas read_csv truncating 0s in zip code [duplicate]

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Got DatetimeIndex. Now how do I get these records to excel?

Goal:
From an excel file, I want to get all the records which have dates that fall within a range and write them to a new excel file. The infile I'm working with has 500K+ rows and 21 columns.
What I've tried:
I've read the infile to a Pandas dataframe then returned the DatetimeIndex. If I print the range variable I get the desired records.
import pandas as pd
in_excel_file = r'path\to\infile.xlsx'
out_excel_file = r'path\to\outfile.xlsx'
df = pd.read_excel(in_excel_file)
range = (pd.date_range(start='1910-1-1', end='2021-1-1'))
print(range)
##prints
DatetimeIndex(['1990-01-01', '1990-01-02', '1990-01-03', '1990-01-04',
'1990-01-05', '1990-01-06', '1990-01-07', '1990-01-08',
'1990-01-09', '1990-01-10',
...
'2020-12-23', '2020-12-24', '2020-12-25', '2020-12-26',
'2020-12-27', '2020-12-28', '2020-12-29', '2020-12-30',
'2020-12-31', '2021-01-01'],
dtype='datetime64[ns]', length=11324, freq='D')
Where I'm having trouble is getting the above DatetimeIndex to the outfile. The following gives me an error:
range.to_excel(out_excel_file, index=False)
AttributeError: 'DatetimeIndex' object has no attribute 'to_excel'
I'm pretty sure that when writing to excel it has to be a dataframe. So, my question is how do I get the range variable to a dataframe object?
Goal: From an excel file, I want to get all the records which have dates that fall within a range and write them to a new excel file. The infile I'm working with has 500K+ rows and 21 columns.
You could use an indexing operation to select only the data you need from the original DataFrame and save the result in an Excel file.
In order to do that first you need to check if the date column from your original DataFrame is already converted to a datetime/date object:
import numpy as np
date_column = "date" # Suppose this is your date column name
if not np.issubdtype(df[date_column].dtype, np.datetime64):
df.loc[:, date_column] = pd.to_datetime(df[date_column], format="%Y-%m-%d")
Now you can use a regular indexing operation to get all values you need:
mask = (df[date_column] >= '1910-01-01') & (df[date_column] <= '2021-01-01') # Creates mask for date range
out_dataframe = df.loc[mask] # Here we select the indices using our mask
out_dataframe.to_excel(out_excel_file)
You can try to create a dataframe from the DatetimeIndex before writing it to Excel, as follows:
range_df = pd.DataFrame(index=range).rename_axis(index='range').reset_index()
or as suggested by #guimorg, we can also do it as:
range_df = range.to_frame(index=False, name='range')
Then, continue with your code to write it to Excel:
range_df.to_excel(out_file, index=False)

removing columns in a loop from different size dataframes [duplicate]

I am reading from an Excel sheet and I want to read certain columns: column 0 because it is the row-index, and columns 22:37. Now here is what I do:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = 37)
df= pd.concat([df[df.columns[0]], df[df.columns[22:]]], axis=1)
But I would hope there is better way to do that! I know if I do parse_cols=[0, 22,..,37] I can do it, but for large datasets this doesn't make sense.
I also did this:
s = pd.Series(0)
s[1]=22
for i in range(2,14):
s[i]=s[i-1]+1
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], parse_cols = s)
But it reads the first 15 columns which is the length of s.
You can use column indices (letters) like this:
import pandas as pd
import numpy as np
file_loc = "path.xlsx"
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols="A,C:AA")
print(df)
Corresponding documentation:
usecols : int, str, list-like, or callable default None
If None, then parse all columns.
If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.
If list of int, then indicates list of column numbers to be parsed.
If list of string, then indicates list of column names to be parsed.
New in version 0.24.0.
If callable, then evaluate each column name against it and parse the column if the callable returns True.
Returns a subset of the columns according to behavior above.
New in version 0.24.0.
parse_cols is deprecated, use usecols instead
that is:
df = pd.read_excel(file_loc, index_col=None, na_values=['NA'], usecols = "A,C:AA")
"usecols" should help, use range of columns (as per excel worksheet, A,B...etc.)
below are the examples
1. Selected Columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A,C,F")
2. Range of Columns and selected column
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H")
3. Multiple Ranges
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:F,H,J:N")
4. Range of columns
df = pd.read_excel(file_location,sheet_name='Sheet1', usecols="A:N")
If you know the names of the columns and do not want to use A,B,D or 0,4,7. This actually works
df = pd.read_excel(url)[['name of column','name of column','name of column','name of column','name of column']]
where "name of column" = columns wanted. Case and whitespace sensitive
Read any column's data in excel
import pandas as pd
name_of_file = "test.xlsx"
data = pd.read_excel(name_of_file)
required_colum_name = "Post test Number"
print(data[required_colum_name])
Unfortunately these methods still seem to read and convert the headers before returning the subselection. I have an Excel sheet with duplicate header names because the sheet contains several similar tables. I want to read those tables individually, so I would want to apply usecols. However, this still add suffixes to the duplicate column names.
To reproduce:
create an Excel sheet with headers named Header1, Header2, Header1, Header2 under columns A, B, C, D
df.read_excel(filename, usecols='C:D')
df.columns will return ['Header1.1', 'Header2.1']
Is there way to circumvent this, aside from splitting and joining the resulting headers? Especially when it is unknown whether there are duplicate columns it is tricky to rename them, as splitting on '.' may be corrupting a non-duplicate header.
Edit: additionally, the length (in indeces) of a DataFrame based on a subset of columns will be determined by the length of the full file. So if column A has 10 rows, and column B only has 5, a DataFrame generated by usecols='B' will have 10 rows of which 5 filled with NaN's.

JSON string within CSV data read by pandas [duplicate]

I am working with CSV files where several of the columns have a simple json object (several key value pairs) while other columns are normal. Here is an example:
name,dob,stats
john smith,1/1/1980,"{""eye_color"": ""brown"", ""height"": 160, ""weight"": 76}"
dave jones,2/2/1981,"{""eye_color"": ""blue"", ""height"": 170, ""weight"": 85}"
bob roberts,3/3/1982,"{""eye_color"": ""green"", ""height"": 180, ""weight"": 94}"
After using df = pandas.read_csv('file.csv'), what's the most efficient way to parse and split the stats column into additional columns?
After about an hour, the only thing I could come up with was:
import json
stdf = df['stats'].apply(json.loads)
stlst = list(stdf)
stjson = json.dumps(stlst)
df.join(pandas.read_json(stjson))
This seems like I'm doing it wrong, and it's quite a bit of work considering I'll need to do this on three columns regularly.
Desired output is the dataframe object below. Added following lines of code to get there in my (crappy) way:
df = df.join(pandas.read_json(stjson))
del(df['stats'])
In [14]: df
Out[14]:
name dob eye_color height weight
0 john smith 1/1/1980 brown 160 76
1 dave jones 2/2/1981 blue 170 85
2 bob roberts 3/3/1982 green 180 94
I think applying the json.load is a good idea, but from there you can simply directly convert it to dataframe columns instead of writing/loading it again:
stdf = df['stats'].apply(json.loads)
pd.DataFrame(stdf.tolist()) # or stdf.apply(pd.Series)
or alternatively in one step:
df.join(df['stats'].apply(json.loads).apply(pd.Series))
There is a slightly easier way, but ultimately you'll have to call json.loads There is a notion of a converter in pandas.read_csv
converters : dict. optional
Dict of functions for converting values in certain columns. Keys can either be integers or column labels
So first define your custom parser. In this case the below should work:
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
In your case you'll have something like:
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
We are telling read_csv to read the data in the standard way, but for the stats column use our custom parsers. This will make the stats column a dict
From here, we can use a little hack to directly append these columns in one step with the appropriate column names. This will only work for regular data (the json object needs to have 3 values or at least missing values need to be handled in our CustomParser)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
On the Left Hand Side, we get the new column names from the keys of the element of the stats column. Each element in the stats column is a dictionary. So we are doing a bulk assign. On the Right Hand Side, we break up the 'stats' column using apply to make a data frame out of each key/value pair.
Option 1
If you dumped the column with json.dumps before you wrote it to csv, you can read it back in with:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': json.loads})
Option 2
If you didn't then you might need to use this:
import json
import pandas as pd
df = pd.read_csv('data/file.csv', converters={'json_column_name': eval})
Option 3
For more complicated situations you can write a custom converter like this:
import json
import pandas as pd
def parse_column(data):
try:
return json.loads(data)
except Exception as e:
print(e)
return None
df = pd.read_csv('data/file.csv', converters={'json_column_name': parse_column})
Paul's original answer was very nice but not correct in general, because there is no assurance that the ordering of columns is the same on the left-hand side and the right-hand side of the last line. (In fact, it does not seem to work on the test data in the question, instead erroneously switching the height and weight columns.)
We can fix this by ensuring that the list of dict keys on the LHS is sorted. This works because the apply on the RHS automatically sorts by the index, which in this case is the list of column names.
def CustomParser(data):
import json
j1 = json.loads(data)
return j1
df = pandas.read_csv(f1, converters={'stats':CustomParser},header=0)
df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series)
json_normalize function in pandas.io.json package helps to do this without using custom function.
(assuming you are loading the data from a file)
from pandas.io.json import json_normalize
df = pd.read_csv(file_path, header=None)
stats_df = json_normalize(data['stats'].apply(ujson.loads).tolist())
stats_df.set_index(df.index, inplace=True)
df.join(stats_df)
del df.drop(df.columns[2], inplace=True)
If you have DateTime values in your .csv file, df[sorted(df['stats'][0].keys())] = df['stats'].apply(pandas.Series) will mess up the date time values
This link has some tip how to read the csv file
with json strings into the dataframe.
You could do the following to read csv file with json string column and convert your json string into columns.
Read your csv into the dataframe (read_df)
read_df = pd.read_csv('yourFile.csv', converters={'state':json.loads}, header=0, quotechar="'")
Convert the json string column to a new dataframe
state_df = read_df['state'].apply(pd.Series)
Merge the 2 dataframe with index number.
df = pd.merge(read_df, state_df, left_index=True, right_index=True)

How to keep leading zeros in a column when reading CSV with Pandas?

I am importing study data into a Pandas data frame using read_csv.
My subject codes are 6 numbers coding, among others, the day of birth. For some of my subjects this results in a code with a leading zero (e.g. "010816").
When I import into Pandas, the leading zero is stripped of and the column is formatted as int64.
Is there a way to import this column unchanged maybe as a string?
I tried using a custom converter for the column, but it does not work - it seems as if the custom conversion takes place before Pandas converts to int.
As indicated in this answer by Lev Landau, there could be a simple solution to use converters option for a certain column in read_csv function.
converters={'column_name': str}
Let's say I have csv file projects.csv like below:
project_name,project_id
Some Project,000245
Another Project,000478
As for example below code is trimming leading zeros:
from pandas import read_csv
dataframe = read_csv('projects.csv')
print dataframe
Result:
project_name project_id
0 Some Project 245
1 Another Project 478
Solution code example:
from pandas import read_csv
dataframe = read_csv('projects.csv', converters={'project_id': str})
print dataframe
Required result:
project_name project_id
0 Some Project 000245
1 Another Project 000478
To have all columns as str:
pd.read_csv('sample.csv', dtype=str)
To have certain columns as str:
# column names which need to be string
lst_str_cols = ['prefix', 'serial']
dict_dtypes = {x: 'str' for x in lst_str_cols}
pd.read_csv('sample.csv', dtype=dict_dtypes)
here is a shorter, robust and fully working solution:
simply define a mapping (dictionary) between variable names and desired data type:
dtype_dic= {'subject_id': str,
'subject_number' : 'float'}
use that mapping with pd.read_csv():
df = pd.read_csv(yourdata, dtype = dtype_dic)
et voila!
If you have a lot of columns and you don't know which ones contain leading zeros that might be missed, or you might just need to automate your code. You can do the following:
df = pd.read_csv("your_file.csv", nrows=1) # Just take the first row to extract the columns' names
col_str_dic = {column:str for column in list(df)}
df = pd.read_csv("your_file.csv", dtype=col_str_dic) # Now you can read the compete file
You could also do:
df = pd.read_csv("your_file.csv", dtype=str)
By doing this you will have all your columns as strings and you won't lose any leading zeros.
You Can do This , Works On all Versions of Pandas
pd.read_csv('filename.csv', dtype={'zero_column_name': object})
You can use converters to convert number to fixed width if you know the width.
For example, if the width is 5, then
data = pd.read_csv('text.csv', converters={'column1': lambda x: f"{x:05}"})
This will do the trick. It works for pandas==0.23.0 and also read_excel.
Python3.6 or higher required.
I don't think you can specify a column type the way you want (if there haven't been changes reciently and if the 6 digit number is not a date that you can convert to datetime). You could try using np.genfromtxt() and create the DataFrame from there.
EDIT: Take a look at Wes Mckinney's blog, there might be something for you. It seems to be that there is a new parser from pandas 0.10 coming in November.
As an example, consider the following my_data.txt file:
id,A
03,5
04,6
To preserve the leading zeros for the id column:
df = pd.read_csv("my_data.txt", dtype={"id":"string"})
df
id A
0 03 5
1 04 6

Categories