Set max string length in pandas - python

I want my dataframe to auto-truncate strings which are longer than a certain length.
basically:
pd.set_option('auto_truncate_string_exceeding_this_length', 255)
Any ideas? I have hundreds of columns and don't want to iterate over every data point. If this can be achieved during import that would also be fine (e.g. pd.read_csv())
Thanks.

pd.set_option('display.max_colwidth', 255)

You can use read_csv converters. Lets say you want to truncate column name abc, you can pass a dictionary with function like
def auto_truncate(val):
return val[:255]
df = pd.read_csv('file.csv', converters={'abc': auto_truncate}
If you have columns with different lengths
df = pd.read_csv('file.csv', converters={'abc': lambda: x: x[:255], 'xyz': lambda: x: x[:512]}
Make sure column type is string. Column index can also be used instead of name in converters dict.

I'm not sure you can do this on the whole df, the following would work after loading:
In [21]:
df = pd.DataFrame({"a":['jasjdhadasd']*5, "b":arange(5)})
df
Out[21]:
a b
0 jasjdhadasd 0
1 jasjdhadasd 1
2 jasjdhadasd 2
3 jasjdhadasd 3
4 jasjdhadasd 4
In [22]:
for col in df:
if is_string_like(df[col]):
df[col] = df[col].str.slice(0,5)
df
Out[22]:
a b
0 jasjd 0
1 jasjd 1
2 jasjd 2
3 jasjd 3
4 jasjd 4
EDIT
I think if you specified the dtypes in the args to read_csv then you could set the max length:
df = pd.read_csv('file.csv', dtype=(np.str, maxlen))
I will try this and confirm shortly
UPDATE
Sadly you cannot specify the length, an error is raised if you try this:
NotImplementedError: the dtype <U5 is not supported for parsing
when attempting to pass the arg dtype=(str,5)

You can also simply truncate a single column with
df['A'] = df['A'].str[:255]

Related

pandas read_csv parse header as string type but i want integer

for example, csv file is as below ,(1,2,3) is header!
1,2,3
0,0,0
I read csv file using pd.read_csv and print
import pandas as pd
df = pd.read_csv('./test.csv')
print(df[1])
it occur error key error:1
it seems like that read_csv parse header as string..
is there any way using integer type in dataframe column?
I think more general is cast to columns names to integer by astype:
df = pd.read_csv('./test.csv')
df.columns = df.columns.astype(int)
Another way is first get only first column and use parameter names in read_csv:
import csv
with open("file.csv", "r") as f:
reader = csv.reader(f)
i = np.array(next(reader)).astype(int)
#another way
#i = pd.read_csv("file.csv", nrows=0).columns.astype(int)
print (i)
[1 2 3]
df = pd.read_csv("file.csv", names=i, skiprows=1)
print (df.columns)
Int64Index([1, 2, 3], dtype='int64')
Skip the header column using skiprows=1 and header=None. This automatically loads in a dataframe with integer headers starting from 0 onwards.
df = pd.read_csv('test.csv', skiprows=1, header=None).rename(columns=lambda x: x + 1)
df
1 2 3
0 0 0 0
The rename call is optional, but if you want your headers to start from 1, you may keep it in.
If you have a MultiIndex, use set_levels to set just the 0th level to integer:
df.columns = df.columns.set_levels(
df.columns.get_level_values(0).astype(int), level=0
)
You can use set_axis in conjunction with a lambda and pd.Index.map
Consider a csv that looks like:
1,1,2,2
a,b,a,b
1,3,5,7
0,2,4,6
Read it like:
df = pd.read_csv('test.csv', header=[0, 1])
df
1 2
a b a b
0 1 3 5 7
1 0 2 4 6
You can pipeline the column setting with integers in the first level like:
df.set_axis(df.columns.map(lambda i: (int(i[0]), i[1])), axis=1, inplace=False)
1 2
a b a b
0 1 3 5 7
1 0 2 4 6
is there any way using integer type in dataframe column?
I find this quite elegant:
df = pd.read_csv('test.csv').rename(columns=int)
Note that int here is the built-in function int().

Forcing pandas .iloc to return a single-row dataframe?

For programming purpose, I want .iloc to consistently return a data frame, even when the resulting data frame has only one row. How to accomplish this?
Currently, .iloc returns a Series when the result only has one row. Example:
In [1]: df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
In [2]: df
Out[2]:
a b
0 1 3
1 2 4
In [3]: type(df.iloc[0, :])
Out[3]: pandas.core.series.Series
This behavior is poor for 2 reasons:
Depending on the number of chosen rows, .iloc can either return a Series or a Data Frame, forcing me to manually check for this in my code
- .loc, on the other hand, always return a Data Frame, making pandas inconsistent within itself (wrong info, as pointed out in the comment)
For the R user, this can be accomplished with drop = FALSE, or by using tidyverse's tibble, which always return a data frame by default.
Use double brackets,
df.iloc[[0]]
Output:
a b
0 1 3
print(type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
Short for df.iloc[[0],:]
Accessing row(s) by label: loc
# Setup
df = pd.DataFrame({'X': [1, 2, 3], 'Y':[4, 5, 6]}, index=['a', 'b', 'c'])
df
X Y
a 1 4
b 2 5
c 3 6
To get a DataFrame instead of a Series, pass a list of indices of length 1,
df.loc[['a']]
# Same as
df.loc[['a'], :] # selects all columns
X Y
a 1 4
To select multiple specific rows, use
df.loc[['a', 'c']]
X Y
a 1 4
c 3 6
To select a contiguous range of rows, use
df.loc['b':'c']
X Y
b 2 5
c 3 6
Access row(s) by position: iloc
Specify a list of indices of length 1,
i = 1
df.iloc[[i]]
X Y
b 2 5
Or, specify a slice of length 1:
df.iloc[i:i+1]
X Y
b 2 5
To select multiple rows or a contiguous slice you'd use a similar syntax as with loc.
The double-bracket approach doesn't always work for me (e.g. when I use a conditional to select a timestamped row with loc).
You can, however, just add to_frame() to your operation.
>>> df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
>>> df2 = df.iloc[0, :].to_frame().transpose()
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
please use the below options:
df1 = df.iloc[[0],:]
#type(df1)
df1
or
df1 = df.iloc[0:1,:]
#type(df1)
df1
For getting single row extraction from Dataframe use:
df_name.iloc[index,:].to_frame().transpose()
single_Sample1=df.iloc[7:10]
single_Sample1
[1]: https://i.stack.imgur.com/RHHDZ.png**strong text**

Adding new columns to DataFrame Python. SettingWithCopyWarning

I'm trying to add a new column to a data frame. I have column of dates, I turn it into seconds-since-epoch and add that to a new column of the data frame
def addEpochTime(df):
df[7] = np.NaN # Adding empty column.
for n in range(0, len(df)): # Writing to empty column.
df[7][n] = df[0][n] - 5 # Conduct some mathematical mutations...
addEpochTime(df)
What I've written above works, but I do get an error, i.e.: SettingWithCopyWarning
My question is, how can I add a new column in a data frame and write data to it
I don't fully understand the way data frames are indexed, despite having read about the it in the pandas documentation.
Since you say -
I have column of dates, I turn it into seconds-since-epoch and add that to a new column of the data frame
If what you are actually doing is simple like - df[7][n] = df[0][n] -5 , then you can simply use series.apply method to do the same thing, In your case -
def addEpochTime(df):
df[7] = df[0].apply(lambda x: x-5)
.apply method accepts a function as the parameter , which is passed the value of each row and it should return the value after applying the logic.
You can also pass in a function that accepts the date as parameter and returns the seconds since epoch, to .apply() , which might be what you are looking for.
Example -
In [4]: df = pd.DataFrame([[1,2],[3,4]],columns=['A','B'])
In [5]: df
Out[5]:
A B
0 1 2
1 3 4
In [6]: df['C'] = df['A'].apply(lambda x: x-5)
In [7]: df
Out[7]:
A B C
0 1 2 -4
1 3 4 -2
You can do it in a single line and avoid the warning:
df
>> a
0 1
1 2
df['b'] = df['a'] - 5
df
>> a b
0 1 -4
1 2 -3

add columns different length pandas

I have a problem with adding columns in pandas.
I have DataFrame, dimensional is nxk. And in process I wiil need add columns with dimensional mx1, where m = [1,n], but I don't know m.
When I try do it:
df['Name column'] = data
# type(data) = list
result:
AssertionError: Length of values does not match length of index
Can I add columns with different length?
If you use accepted answer, you'll lose your column names, as shown in the accepted answer example, and described in the documentation (emphasis added):
The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information.
It looks like column names ('Name column') are meaningful to the Original Poster / Original Question.
To save column names, use pandas.concat, but don't ignore_index (default value of ignore_index is false; so you can omit that argument altogether). Continue to use axis=1:
import pandas
# Note these columns have 3 rows of values:
original = pandas.DataFrame({
'Age':[10, 12, 13],
'Gender':['M','F','F']
})
# Note this column has 4 rows of values:
additional = pandas.DataFrame({
'Name': ['Nate A', 'Jessie A', 'Daniel H', 'John D']
})
new = pandas.concat([original, additional], axis=1)
# Identical:
# new = pandas.concat([original, additional], ignore_index=False, axis=1)
print(new.head())
# Age Gender Name
#0 10 M Nate A
#1 12 F Jessie A
#2 13 F Daniel H
#3 NaN NaN John D
Notice how John D does not have an Age or a Gender.
Use concat and pass axis=1 and ignore_index=True:
In [38]:
import numpy as np
df = pd.DataFrame({'a':np.arange(5)})
df1 = pd.DataFrame({'b':np.arange(4)})
print(df1)
df
b
0 0
1 1
2 2
3 3
Out[38]:
a
0 0
1 1
2 2
3 3
4 4
In [39]:
pd.concat([df,df1], ignore_index=True, axis=1)
Out[39]:
0 1
0 0 0
1 1 1
2 2 2
3 3 3
4 4 NaN
We can add the different size of list values to DataFrame.
Example
a = [0,1,2,3]
b = [0,1,2,3,4,5,6,7,8,9]
c = [0,1]
Find the Length of all list
la,lb,lc = len(a),len(b),len(c)
# now find the max
max_len = max(la,lb,lc)
Resize all according to the determined max length (not in this example
if not max_len == la:
a.extend(['']*(max_len-la))
if not max_len == lb:
b.extend(['']*(max_len-lb))
if not max_len == lc:
c.extend(['']*(max_len-lc))
Now the all list is same length and create dataframe
pd.DataFrame({'A':a,'B':b,'C':c})
Final Output is
A B C
0 1 0 1
1 2 1
2 3 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I had the same issue, two different dataframes and without a common column. I just needed to put them beside each other in a csv file.
Merge:
In this case, "merge" does not work; even adding a temporary column to both dfs and then dropping it. Because this method makes both dfs with the same length. Hence, it repeats the rows of the shorter dataframe to match the longer dataframe's length.
Concat:
The idea of The Red Pea didn't work for me. It just appended the shorter df to the longer one (row-wise) while leaving an empty column (NaNs) above the shorter df's column.
Solution: You need to do the following:
df1 = df1.reset_index()
df2 = df2.reset_index()
df = [df1, df2]
df_final = pd.concat(df, axis=1)
df_final.to_csv(filename, index=False)
This way, you'll see your dfs besides each other (column-wise), each of which with its own length.
If somebody like to replace a specific column of a different size instead of adding it.
Based on this answer, I use a dict as an intermediate type.
Create Pandas Dataframe with different sized columns
If the column to be inserted is not a list but already a dict, the respective line can be omitted.
def fill_column(dataframe: pd.DataFrame, list: list, column: str):
dict_from_list = dict(enumerate(list)) # create enumertable object from list and create dict
dataFrame_asDict = dataframe.to_dict() # Get DataFrame as Dict
dataFrame_asDict[column] = dict_from_list # Assign specific column
return pd.DataFrame.from_dict(dataFrame_asDict, orient='index').T # Create new DataSheet from Dict and return it

Is there a way to do a Series.map in place, but keep original value if no match?

The scenario here is that I've got a dataframe df with raw integer data, and a dict map_array which maps those ints to string values.
I need to replace the values in the dataframe with the corresponding values from the map, but keep the original value if the it doesn't map to anything.
So far, the only way I've been able to figure out how to do what I want is by using a temporary column. However, with the size of data that I'm working with, this could sometimes get a little bit hairy. And so, I was wondering if there was some trick to do this in pandas without needing the temp column...
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(1,5, size=(100,1)))
map_array = {1:'one', 2:'two', 4:'four'}
df['__temp__'] = df[0].map(map_array, na_action=None)
#I've tried varying the na_action arg to no effect
nan_index = data['__temp__'][df['__temp__'].isnull() == True].index
df['__temp__'].ix[nan_index] = df[0].ix[nan_index]
df[0] = df['__temp__']
df = df.drop(['__temp__'], axis=1)
I think you can simply use .replace, whether on a DataFrame or a Series:
>>> df = pd.DataFrame(np.random.randint(1,5, size=(3,3)))
>>> df
0 1 2
0 3 4 3
1 2 1 2
2 4 2 3
>>> map_array = {1:'one', 2:'two', 4:'four'}
>>> df.replace(map_array)
0 1 2
0 3 four 3
1 two one two
2 four two 3
>>> df.replace(map_array, inplace=True)
>>> df
0 1 2
0 3 four 3
1 two one two
2 four two 3
I'm not sure what the memory hit of changing column dtypes will be, though.

Categories