When I use the following on a df...
df.columns.values.tostring()
I get the following which are not at all like my column names (and there are far fewer columns than that). When I omit "tolist()", I just get the column names.
b'0\x16B\n\x00\x00\x00\x00p\x84P\n\x00\x00\x00\x00\xf0\xe7x\t\x00\x00\x00\x00\xb0\xf3J\n\x00\x00\x00\x00p\xfc\t\x0c\x00\x00\x00\x000\xad\xd7\x00\x00\x00\x00\x00p\xae\xd7\x00\x00\x00\x00\x00\xf0\xab\xd7\x00\x00\x00\x00\x00(9\x05\x01\x00\x00\x00\x00\xf0\xa7\xdd\x0b\x00\x00\x00\x00p\xac\xdd\x0b\x00\x00\x00\x00\xf0\xed\xc1\x00\x00\x00\x00\x00\xb0\xa3\xdd\x0b\x00\x00\x00\x000g\xdd\x0b\x00\x00\x00\x00p\xf2\xb2\x0c\x00\x00\x00\x000\xf1\xb2\x0c\x00\x00\x00\x00\xf0\xf0\xb2\x0c\x00\x00\x00\x00\xb0\xf0\xb2\x0c\x00\x00\x00\x00\xa0w\x9a\x05\x00\x00\x00\x000\xae\xd7\x00\x00\x00\x00\x00\x90\x9c\xe4\x00\x00\x00\x00\x00\xd0U\n\x0c\x00\x00\x00\x00\xb0\xfa\t\x0c\x00\x00\x00\x00\xb0\n\xca\x00\x00\x00\x00\x00\x88\x8e\xbb\x00\x00\x00\x00\x00\xf0\x05\xca\x00\x00\x00\x00\x00\x90<y\t\x00\x00\x00\x00\x18?y\t\x00\x00\x00\x00\xb0\x01\xca\x00\x00\x00\x00\x00\xb0=y\t\x00\x00\x00\x00\xf8=y\t\x00\x00\x00\x00p\xac\xd7\x00\x00\x00\x00\x00\xb0\xad\xd7\x00\x00\x00\x00\x00'
I can't figure out why. The df is a product of several instances of pd.merge and type conversions.
This isn't really a pandas thing, it's a numpy thing. df.columns.values gives us a numpy array:
>>> df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]})
>>> df
A B
0 1 4
1 2 5
2 3 6
>>> df.columns
Index(['A', 'B'], dtype='object')
>>> df.columns.values
array(['A', 'B'], dtype=object)
The tostring method of a numpy array promises:
Construct Python bytes containing the raw data bytes in the array.
Constructs Python bytes showing a copy of the raw contents of data memory. The bytes object can be produced in either ‘C’ or ‘Fortran’, or ‘Any’ order (the default is ‘C’-order). ‘Any’ order means C-order unless the F_CONTIGUOUS flag in the array is set, in which case it means ‘Fortran’ order.
This function is a compatibility alias for tobytes. Despite its name it returns bytes not strings.
which is why you get something messy:
>>> df.columns.values.tostring()
b'\xe0N\x0e\xb7\x00\\\x14\xb7'
Related
This is data in .csv format file
generally we expect array/ list with [1,2,3,4] comma separated values
which it seems that nothing happened in this case
data = pd.read_csv('file.csv')
data_array = data.values
print(data_array)
print(type(data_array[0]))
and here is the output data
[16025788 179 '179batch1640694482' 18055630 8317948789 '2021-12-28'
8315780000.0 '6214' 'CA' Nan Nan 'Wireless' '2021-12-28 12:32:46'
'2021-12-28 12:32:46']
<class 'numpy.ndarray'>
So, i am looking for way to find array with comma separated values
Okay so simply make the changes:
converted_str = numpy.array_str(data_array)
converted_str.replace(' ',',')
print(converted_str)
Now, if you want to get the output in <class 'numpy.ndarray'> simply convert it back to a numpy array. I hope this helps! 😉
Without the csv or dataframe (or at least a sample) there's some ambiguity as to what your data array is like. But let me illustrate things with sample.
In [166]: df = pd.DataFrame([['one',2],['two',3]])
the dataframe display:
In [167]: df
Out[167]:
0 1
0 one 2
1 two 3
The array derived from the frame:
In [168]: data = df.values
In [169]: data
Out[169]:
array([['one', 2],
['two', 3]], dtype=object)
In my Ipython session, the display is actually the repr representation of the array. Note the commas, word 'array', and dtype.
In [170]: print(repr(data))
array([['one', 2],
['two', 3]], dtype=object)
A print of the array omits those words and commas. That's the str format. Omitting the commas is normal for numpy arrays, and helps distinguish them from lists. But let me stress that this is just the display style.
In [171]: print(data)
[['one' 2]
['two' 3]]
In [172]: print(data[0])
['one' 2]
We can convert the array to a list:
In [173]: alist = data.tolist()
In [174]: alist
Out[174]: [['one', 2], ['two', 3]]
Commas are a standard part of list display.
But let me stress, commas or not, is part of the display. Don't confuse that with the underlying distinction between a pandas dataframe, a numpy array, and a Python list.
Convert to a normal python list first:
print(list(data_array))
I have the following DataFrame from a SQL query:
(Pdb) pp total_rows
ColumnID RespondentCount
0 -1 2
1 3030096843 1
2 3030096845 1
and I want to pivot it like this:
total_data = total_rows.pivot_table(cols=['ColumnID'])
(Pdb) pp total_data
ColumnID -1 3030096843 3030096845
RespondentCount 2 1 1
[1 rows x 3 columns]
total_rows.pivot_table(cols=['ColumnID']).to_dict('records')[0]
{3030096843: 1, 3030096845: 1, -1: 2}
but I want to make sure the 303 columns are casted as strings instead of integers so that I get this:
{'3030096843': 1, '3030096845': 1, -1: 2}
One way to convert to string is to use astype:
total_rows['ColumnID'] = total_rows['ColumnID'].astype(str)
However, perhaps you are looking for the to_json function, which will convert keys to valid json (and therefore your keys to strings):
In [11]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [12]: df.to_json()
Out[12]: '{"0":{"0":"A","1":"A","2":"B"},"1":{"0":2,"1":4,"2":6}}'
In [13]: df[0].to_json()
Out[13]: '{"0":"A","1":"A","2":"B"}'
Note: you can pass in a buffer/file to save this to, along with some other options...
If you need to convert ALL columns to strings, you can simply use:
df = df.astype(str)
This is useful if you need everything except a few columns to be strings/objects, then go back and convert the other ones to whatever you need (integer in this case):
df[["D", "E"]] = df[["D", "E"]].astype(int)
pandas >= 1.0: It's time to stop using astype(str)!
Prior to pandas 1.0 (well, 0.25 actually) this was the defacto way of declaring a Series/column as as string:
# pandas <= 0.25
# Note to pedants: specifying the type is unnecessary since pandas will
# automagically infer the type as object
s = pd.Series(['a', 'b', 'c'], dtype=str)
s.dtype
# dtype('O')
From pandas 1.0 onwards, consider using "string" type instead.
# pandas >= 1.0
s = pd.Series(['a', 'b', 'c'], dtype="string")
s.dtype
# StringDtype
Here's why, as quoted by the docs:
You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.
object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text but still object-dtype columns.
When reading code, the contents of an object dtype array is less clear than 'string'.
See also the section on Behavioral Differences between "string" and object.
Extension types (introduced in 0.24 and formalized in 1.0) are closer to pandas than numpy, which is good because numpy types are not powerful enough. For example NumPy does not have any way of representing missing data in integer data (since type(NaN) == float). But pandas can using Nullable Integer columns.
Why should I stop using it?
Accidentally mixing dtypes
The first reason, as outlined in the docs is that you can accidentally store non-text data in object columns.
# pandas <= 0.25
pd.Series(['a', 'b', 1.23]) # whoops, this should have been "1.23"
0 a
1 b
2 1.23
dtype: object
pd.Series(['a', 'b', 1.23]).tolist()
# ['a', 'b', 1.23] # oops, pandas was storing this as float all the time.
# pandas >= 1.0
pd.Series(['a', 'b', 1.23], dtype="string")
0 a
1 b
2 1.23
dtype: string
pd.Series(['a', 'b', 1.23], dtype="string").tolist()
# ['a', 'b', '1.23'] # it's a string and we just averted some potentially nasty bugs.
Challenging to differentiate strings and other python objects
Another obvious example example is that it's harder to distinguish between "strings" and "objects". Objects are essentially the blanket type for any type that does not support vectorizable operations.
Consider,
# Setup
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [{}, [1, 2, 3], 123]})
df
A B
0 a {}
1 b [1, 2, 3]
2 c 123
Upto pandas 0.25, there was virtually no way to distinguish that "A" and "B" do not have the same type of data.
# pandas <= 0.25
df.dtypes
A object
B object
dtype: object
df.select_dtypes(object)
A B
0 a {}
1 b [1, 2, 3]
2 c 123
From pandas 1.0, this becomes a lot simpler:
# pandas >= 1.0
# Convenience function I call to help illustrate my point.
df = df.convert_dtypes()
df.dtypes
A string
B object
dtype: object
df.select_dtypes("string")
A
0 a
1 b
2 c
Readability
This is self-explanatory ;-)
OK, so should I stop using it right now?
...No. As of writing this answer (version 1.1), there are no performance benefits but the docs expect future enhancements to significantly improve performance and reduce memory usage for "string" columns as opposed to objects. With that said, however, it's never too early to form good habits!
Here's the other one, particularly useful to convert the multiple columns to string instead of just single column:
In [76]: import numpy as np
In [77]: import pandas as pd
In [78]: df = pd.DataFrame({
...: 'A': [20, 30.0, np.nan],
...: 'B': ["a45a", "a3", "b1"],
...: 'C': [10, 5, np.nan]})
...:
In [79]: df.dtypes ## Current datatype
Out[79]:
A float64
B object
C float64
dtype: object
## Multiple columns string conversion
In [80]: df[["A", "C"]] = df[["A", "C"]].astype(str)
In [81]: df.dtypes ## Updated datatype after string conversion
Out[81]:
A object
B object
C object
dtype: object
There are four ways to convert columns to string
1. astype(str)
df['column_name'] = df['column_name'].astype(str)
2. values.astype(str)
df['column_name'] = df['column_name'].values.astype(str)
3. map(str)
df['column_name'] = df['column_name'].map(str)
4. apply(str)
df['column_name'] = df['column_name'].apply(str)
Lets see the performance of each type
#importing libraries
import numpy as np
import pandas as pd
import time
#creating four sample dataframes using dummy data
df1 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df2 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df3 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df4 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
#applying astype(str)
time1 = time.time()
df1['A'] = df1['A'].astype(str)
print('time taken for astype(str) : ' + str(time.time()-time1) + ' seconds')
#applying values.astype(str)
time2 = time.time()
df2['A'] = df2['A'].values.astype(str)
print('time taken for values.astype(str) : ' + str(time.time()-time2) + ' seconds')
#applying map(str)
time3 = time.time()
df3['A'] = df3['A'].map(str)
print('time taken for map(str) : ' + str(time.time()-time3) + ' seconds')
#applying apply(str)
time4 = time.time()
df4['A'] = df4['A'].apply(str)
print('time taken for apply(str) : ' + str(time.time()-time4) + ' seconds')
Output
time taken for astype(str): 5.472359895706177 seconds
time taken for values.astype(str): 6.5844292640686035 seconds
time taken for map(str): 2.3686647415161133 seconds
time taken for apply(str): 2.39758563041687 seconds
map(str) and apply(str) are takes less time compare with remaining two techniques
I usually use this one:
pd['Column'].map(str)
pandas version: 1.3.5
Updated answer
df['colname'] = df['colname'].astype(str) => this should work by default. But if you create str variable like str = "myString" before using astype(str), this won't work. In this case, you might want to use the below line.
df['colname'] = df['colname'].astype('str')
===========
(Note: incorrect old explanation)
df['colname'] = df['colname'].astype('str') => converts dataframe column into a string type
df['colname'] = df['colname'].astype(str) => gives an error
Using .apply() with a lambda conversion function also works in this case:
total_rows['ColumnID'] = total_rows['ColumnID'].apply(lambda x: str(x))
For entire dataframes you can use .applymap().
(but in any case probably .astype() is faster)
I'm looking for a method to add a column of float values to a matrix of string values.
Mymatrix =
[["a","b"],
["c","d"]]
I need to have a matrix like this =
[["a","b",0.4],
["c","d",0.6]]
I would suggest using a pandas DataFrame instead:
import pandas as pd
df = pd.DataFrame([["a","b",0.4],
["c","d",0.6]])
print(df)
0 1 2
0 a b 0.4
1 c d 0.6
You can also specify column (Series) names:
df = pd.DataFrame([["a","b",0.4],
["c","d",0.6]], columns=['A', 'B', 'C'])
df
A B C
0 a b 0.4
1 c d 0.6
As noted you can't mix data types in a ndarray, but can do so in a structured or record array. They are similar in that you can mix datatypes as defined by the dtype= argument (it defines the datatypes and field names). Record arrays allow access to fields of structured arrays by attribute instead of only by index. You don't need for loops when you want to copy the entire contents between arrays. See my example below (using your data):
Mymatrix = np.array([["a","b"], ["c","d"]])
Mycol = np.array([0.4, 0.6])
dt=np.dtype([('col0','U1'),('col1','U1'),('col2',float)])
new_recarr = np.empty((2,), dtype=dt)
new_recarr['col0'] = Mymatrix[:,0]
new_recarr['col1'] = Mymatrix[:,1]
new_recarr['col2'] = Mycol[:]
print (new_recarr)
Resulting output looks like this:
[('a', 'b', 0.4) ('c', 'd', 0.6)]
From there, use formatted strings to print.
You can also copy from a recarray to an ndarray if you reverse assignment order in my example.
Note: I discovered there can be a significant performance penalty when using recarrays. See answer in this thread:
is ndarray faster than recarray access?
You need to understand why you do that. Numpy is efficient because data are aligned in memory. So mixing types is generally source of bad performance. but in your case you can preserve alignement, since all your strings have same length. since types are not homogeneous, you can use structured array:
raw=[["a","b",0.4],
["c","d",0.6]]
dt=dtype([('col0','U1'),('col1','U1'),('col2',float)])
aligned=ndarray(len(raw),dt)
for i in range (len(raw)):
for j in range (len(dt)):
aligned[i][j]=raw[i][j]
You can also use pandas, but you loose often some performance.
I have a Python3.x pandas DataFrame whereby certain columns are strings which as expressed as bytes (like in Python2.x)
import pandas as pd
df = pd.DataFrame(...)
df
COLUMN1 ....
0 b'abcde' ....
1 b'dog' ....
2 b'cat1' ....
3 b'bird1' ....
4 b'elephant1' ....
When I access by column with df.COLUMN1, I see Name: COLUMN1, dtype: object
However, if I access by element, it is a "bytes" object
df.COLUMN1.ix[0].dtype
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'dtype'
How do I convert these into "regular" strings? That is, how can I get rid of this b'' prefix?
You can use vectorised str.decode to decode byte strings into ordinary strings:
df['COLUMN1'].str.decode("utf-8")
To do this for multiple columns you can select just the str columns:
str_df = df.select_dtypes([np.object])
convert all of them:
str_df = str_df.stack().str.decode('utf-8').unstack()
You can then swap out converted cols with the original df cols:
for col in str_df:
df[col] = str_df[col]
Combining the answers by #EdChum and #Yu Zhou, a simpler solution would be:
for col, dtype in df.dtypes.items():
if dtype == np.object: # Only process byte object columns.
df[col] = df[col].apply(lambda x: x.decode("utf-8"))
I add issue with some columns being either full of str or mixed of str and bytes in a dataframe. Solved with a minor modification of the solution provided by #Christabella Irwanto: (i'm more of fan of the str.decode('utf-8') as suggested by #Mad Physicist)
for col, dtype in df.dtypes.items():
if dtype == object: # Only process object columns.
# decode, or return original value if decode return Nan
df[col] = df[col].str.decode('utf-8').fillna(df[col])
>>> df[col]
0 Element
1 b'Element'
2 b'165'
3 165
4 25
5 25
>>> df[col].str.decode('utf-8').fillna(df[col])
0 Element
1 Element
2 165
3 165
4 25
5 25
6 25
(replaced np.object with object to work with recent numpy version)
I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str, others of type bytes. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str. (python 3.6.9, pandas 1.0.5)
>>> import pandas as pd
>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])
>>> ser.values
array([b'value_1', 'value_2'], dtype=object)
>>> ser2 = ser.str.decode("utf-8")
>>> ser[~ser2.isna()] = ser2
>>> ser.values
array(['value_1', 'value_2'], dtype=object)
Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the "errors" kwarg to str.decode but I didn't find one documented.
EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:
ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")
df['COLUMN1'].apply(lambda x: x.decode("utf-8"))
I'm wondering why HDFStore gives warnings on string columns in pandas. I thought it may be NaNs in my real database, but trying it here gives me the warning for both columns even though one is not mixed and is simply strings.
Using .13.1 pandas and 3.1.1 tables
In [75]: d1 = {1:{'Mix': 'Hello', 'Good': 'Hello'}}
In [76]: d2 = {2:{'Good':'Goodbye'}}
In [77]: d2_df = pd.DataFrame.from_dict(d2,orient='index')
In [78]: d_df = pd.DataFrame.from_dict(d1,orient='index')
In [80]: d = pd.concat([d_df,d2_df])
In [81]: d
Out[81]:
Good Mix
1 Hello Hello
2 Goodbye NaN
[2 rows x 2 columns]
In [84]: d.to_hdf('test_.h5','d')
/home/cschwalbach/venv/lib/python2.7/site-packages/pandas-0.13.1-py2.7-linux-x86_64.egg/pandas/io/pytables.py:2446: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Good', 'Mix']]
warnings.warn(ws, PerformanceWarning)
When storing using the fixed format (which if you don't specify format, defaults to fixed), you are storing object dtypes (strings are stored as object dtypes in pandas). These are variable length formats which are not supported by PyTables in the Array types (CArray, EArray), see the warning here
You can however store in a format='table'; see here for the docs on storing fixed-length strings.
The NaN value is the issue here. If you manage to replace with an empty string, the warning will go away.