So I'm trying to get the binary data field in a database as a hexadecimal string. I don't know if that is exactly what I'm looking for but that is my hunch. There is a column called status in my dataframe that says [binary data] in PostgreSQL but when I execute the following command it looks like this:
df = pd.read_sql_query("""SELECT * FROM public."Vehtek_SLG" LIMIT 1000""",con=engine.connect())
How do I get the actual data in that column?
It looks like your DataFrame has for each row a list of individual bytes instead of the entire hexadecimal bytes string. The Series df["status"].map(b"".join) will have the concatenated bytes strings.
import random
import pandas as pd
# Simulating lists of 10 bytes for each row
df = pd.DataFrame({
"status": [
[bytes([random.randint(0, 255)]) for _ in range(10)]
for _ in range(5)
]
})
s = df["status"].map(b"".join)
Both objects look like:
# df
status
0 [b'\xb3', b'f', b';', b'P', b'\xcb', b'\x9b', ...
1 [b'\xd2', b'\xe8', b'.', b'b', b'g', b'|', b'\...
2 [b'\xa7', b'\xe1', b'z', b'-', b'W', b'\xb8', ...
3 [b'\xc5', b'\xa9', b'\xd5', b'\xde', b'\x1d', ...
4 [b'\xa3', b'b', b')', b'\xe3', b'5', b'`', b'\...
# s
0 b'\xb3f;P\xcb\x9bi\xb0\x9e\xfd'
1 b'\xd2\xe8.bg|\x94O\x90\n'
2 b'\xa7\xe1z-W\xb8\xc2\x84\xb91'
3 b'\xc5\xa9\xd5\xde\x1d\x02*}I\x15'
4 b'\xa3b)\xe35`\x0ed#g'
Name: status, dtype: object
After coverting the status field to binary we can then use the following to make it hexadecimal.
df['status'] = s.apply(bytes.hex)
And now here is your field!
df['status'].head()
0 1f8b0800000000000400c554cd8ed33010beafb4ef6045...
1 1f8b0800000000000400c554cd8ed33010beafb4ef6045...
2 1f8b0800000000000400c554cd6e9b4010be47ca3bac50...
3 1f8b0800000000000400c554cd6e9b4010be47ca3bac50...
4 1f8b0800000000000400c554cd6e9b4010be47ca3bac50...
Related
I want to remove commas from a column named size.
CSV looks like below:
number name size
1 Car 9,32,123
2 Bike 1,00,000
3 Truck 10,32,111
I want the output as below:
number name size
1 Car 932123
2 Bike 100000
3 Truck 1032111
I am using python3 and Pandas module for handling this csv.
I am trying replace method but I don't get the desired output.
Snapshot from my code :
import pandas as pd
df = pd.read_csv("file.csv")
// df.replace(",","")
// df['size'] = df['size'].replace(to_replace = "," , value = "")
// df['size'] = df['size'].replace(",", "")
df['size'] = df['size'].replace({",", ""})
print(df['size']) // expecting to see 'size' column without comma
I don't see any error/exception. The last line print(df['size']) simply displays values as it is, ie, with commas.
With replace, we need regex=True because otherwise it looks for exact match in a cell, i.e., cells with , in them only:
>>> df["size"] = df["size"].replace(",", "", regex=True)
>>> df
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
I am using python3 and Pandas module for handling this csv
Note that pandas.read_csv function has optional argument thousands, if , are used for denoting thousands you might set thousands="," consider following example
import io
import pandas as pd
some_csv = io.StringIO('value\n"1"\n"1,000"\n"1,000,000"\n')
df = pd.read_csv(some_csv, thousands=",")
print(df)
output
value
0 1
1 1000
2 1000000
For brevity I used io.StringIO, same effect might be achieved providing name of file with same content as first argument in io.StringIO.
Try with str.replace instead:
df['size'] = df['size'].str.replace(',', '')
Optional convert to int with astype:
df['size'] = df['size'].str.replace(',', '').astype(int)
number name size
0 1 Car 932123
1 2 Bike 100000
2 3 Truck 1032111
Sample Frame Used:
df = pd.DataFrame({'number': [1, 2, 3], 'name': ['Car', 'Bike', 'Truck'],
'size': ['9,32,123', '1,00,000', '10,32,111']})
number name size
0 1 Car 9,32,123
1 2 Bike 1,00,000
2 3 Truck 10,32,111
I have a .csv with a 'cities' column. The column's values are supposed to be a list with each element being a list itself in the following format:
['City', (latitude, longitude)]
So for example:
[['Athens', (37.9839412, 23.7283052)], ['Heraklion', (35.3400127, 25.1343475)], ['Mykonos', (37.45142265, 25.392303200095327)]]
I am trying to load the csv into a pandas dataframe using pd.read_csv().
The value in the column ends up with type string and looks like this:
'[[\'Athens\', (37.9839412, 23.7283052)], [\'Heraklion\', (35.3400127, 25.1343475)], [\'Mykonos\', (37.45142265, 25.392303200095327)]]'
However, because its a string, its just seeing each element as one character.
When I do:
for i in cities:
print(i)
Or:
list(cities)
I get:
[
[
'
A
t
h
e
n
s
'
,
(
3
7
.
9
8
3
9
4
1
2
,
2
3
.
7
2
8
3
0
5
2
)
]
,
etc.
I am looking for a way to 're-build' the data back into python list format so that I can access the string 'Athens' with df.loc[0]['cities][0] and the tuple (37.9839412, 23.7283052) with df.loc[0]['cities][1].
I have tried df['cities'].astype(list) which results in the error:
TypeError: dtype '<class 'list'>' not understood
It seems the data is a string that looks like a Python array. You can access the values for this using ast.literal_eval by applying this literal_eval function on every row in the DataFrame and storing the output city and coordinates as separate columns in the DataFrame.
What's happening is that it's currently a string, rather than a list. Simply implement this:
import ast
l = '[[\'Athens\', (37.9839412, 23.7283052)], [\'Heraklion\', (35.3400127, 25.1343475)], [\'Mykonos\', (37.45142265, 25.392303200095327)]]'
res = ast.literal_eval(l)
print(res)
print(type(res))
Output:
[['Athens', (37.9839412, 23.7283052)], ['Heraklion', (35.3400127, 25.1343475)], ['Mykonos', (37.45142265, 25.392303200095327)]]
<class 'list'>
I have a Python's (2.7) Pandas DF which has columns which looks something like this :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com']
I want to extract email from it without the square bracket and single quotes. Output should like this :
email
jsaw#yahoo.com
jfsjhj#yahoo.com
jwrk#yahoo.com
rankw#yahoo.com
I have tried the suggestions from this answer :Replace all occurrences of a string in a pandas dataframe (Python) . But its not working. Any help will be appreciated.
edit:
What if I have array of more than 1 dimension. something like :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com','fsffsnl#gmail.com']
['mklcu#yahoo.com','riserk#gmail.com', 'funkdl#yahoo.com']
is it possible to get the output in three different columns without square brackets and single quotes.
You can use str.strip if type of values is string:
print type(df.at[0,'email'])
<type 'str'>
df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
If type is list apply Series:
print type(df.at[0,'email'])
<type 'list'>
df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
EDIT: If you have multiple values in array, you can use:
df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com fsffsnl#gmail.com
4 mklcu#yahoo.com riserk#gmail.com funkdl#yahoo.com
Try this one:
from re import findall
s = "['rankw#yahoo.com']"
m = findall(r"\[([A-Za-z0-9#'._]+)\]", s)
print(m[0].replace("'",''))
I have data streaming in the following format:
from StringIO import StringIO
data ="""\
ANI/IP
sip:5554447777#10.94.2.15
sip:10.66.7.34#6665554444
sip:3337775555#10.94.2.11
"""
import pandas as pd
df = pd.read_table(StringIO(data),sep='\s+',dtype='str')
What I would like to do is replace the column content with just the phone number part of the string above. I tried the suggestions from this thread like so:
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
print(df)
However, this results in:
.....print(df)
ANI/IP
0 sip:#10.94.2.15
1 sip:#10.66.7.34
2 sip:#10.94.2.11
I need the phone numbers, so how do I achieve this? :
ANI/IP
0 5554447777
1 6665554444
2 3337775555
The regex \d{10} searches for substring of digits precisely 10 characters long.
df['ANI/IP'] = df['ANI/IP'].str.replace(r'\d{10}', '').astype('str')
This removes the numbers!
Note: You shouldn't do astype str (it's not needed and there is no str dtype in pandas).
You want to extract these phone numbers:
In [11]: df["ANI/IP"].str.extract(r'(\d{10})') # before overwriting!
Out[11]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
Set this as another column and you're away:
In [12]: df["phone_number"] = df["ANI/IP"].str.extract(r'(\d{10})')
You could use pandas.core.strings.StringMethods.extract to extract
In [10]: df['ANI/IP'].str.extract("(\d{10})")
Out[10]:
0 5554447777
1 6665554444
2 3337775555
Name: ANI/IP, dtype: object
I am writing a pandas df to a csv. When I write it to a csv file, some of the elements in one of the columns are being incorrectly converted to scientific notation/numbers. For example, col_1 has strings such as '104D59' in it. The strings are mostly represented as strings in the csv file, as they should be. However, occasional strings, such as '104E59', are being converted into scientific notation (e.g., 1.04 E 61) and represented as integers in the ensuing csv file.
I am trying to export the csv file into a software package (i.e., pandas -> csv -> software_new) and this change in data type is causing problems with that export.
Is there a way to write the df to a csv, ensuring that all elements in df['problem_col'] are represented as string in the resulting csv or not converted to scientific notation?
Here is the code I have used to write the pandas df to a csv:
df.to_csv('df.csv', encoding='utf-8')
I also check the dtype of the problem column:
for df.dtype, df['problem_column'] is an object
For python 3.xx (Python 3.7.2)&
In [2]: pd.__version__ Out[2]: '0.23.4':
Options and Settings
For visualization of the dataframe pandas.set_option
import pandas as pd #import pandas package
# for visualisation fo the float data once we read the float data:
pd.set_option('display.html.table_schema', True) # to can see the dataframe/table as a html
pd.set_option('display.precision', 5) # setting up the precision point so can see the data how looks, here is 5
df = pd.DataFrame(np.random.randn(20,4)* 10 ** -12) # create random dataframe
Output of the data:
df.dtypes # check datatype for columns
[output]:
0 float64
1 float64
2 float64
3 float64
dtype: object
Dataframe:
df # output of the dataframe
[output]:
0 1 2 3
0 -2.01082e-12 1.25911e-12 1.05556e-12 -5.68623e-13
1 -6.87126e-13 1.91950e-12 5.25925e-13 3.72696e-13
2 -1.48068e-12 6.34885e-14 -1.72694e-12 1.72906e-12
3 -5.78192e-14 2.08755e-13 6.80525e-13 1.49018e-12
4 -9.52408e-13 1.61118e-13 2.09459e-13 2.10940e-13
5 -2.30242e-13 -1.41352e-13 2.32575e-12 -5.08936e-13
6 1.16233e-12 6.17744e-13 1.63237e-12 1.59142e-12
7 1.76679e-13 -1.65943e-12 2.18727e-12 -8.45242e-13
8 7.66469e-13 1.29017e-13 -1.61229e-13 -3.00188e-13
9 9.61518e-13 9.71320e-13 8.36845e-14 -6.46556e-13
10 -6.28390e-13 -1.17645e-12 -3.59564e-13 8.68497e-13
11 3.12497e-13 2.00065e-13 -1.10691e-12 -2.94455e-12
12 -1.08365e-14 5.36770e-13 1.60003e-12 9.19737e-13
13 -1.85586e-13 1.27034e-12 -1.04802e-12 -3.08296e-12
14 1.67438e-12 7.40403e-14 3.28035e-13 5.64615e-14
15 -5.31804e-13 -6.68421e-13 2.68096e-13 8.37085e-13
16 -6.25984e-13 1.81094e-13 -2.68336e-13 1.15757e-12
17 7.38247e-13 -1.76528e-12 -4.72171e-13 -3.04658e-13
18 -1.06099e-12 -1.31789e-12 -2.93676e-13 -2.40465e-13
19 1.38537e-12 9.18101e-13 5.96147e-13 -2.41401e-12
And now write to_csv using the float_format='%.15f' parameter
df.to_csv('estc.csv',sep=',', float_format='%.15f') # write with precision .15
file output:
,0,1,2,3
0,-0.000000000002011,0.000000000001259,0.000000000001056,-0.000000000000569
1,-0.000000000000687,0.000000000001919,0.000000000000526,0.000000000000373
2,-0.000000000001481,0.000000000000063,-0.000000000001727,0.000000000001729
3,-0.000000000000058,0.000000000000209,0.000000000000681,0.000000000001490
4,-0.000000000000952,0.000000000000161,0.000000000000209,0.000000000000211
5,-0.000000000000230,-0.000000000000141,0.000000000002326,-0.000000000000509
6,0.000000000001162,0.000000000000618,0.000000000001632,0.000000000001591
7,0.000000000000177,-0.000000000001659,0.000000000002187,-0.000000000000845
8,0.000000000000766,0.000000000000129,-0.000000000000161,-0.000000000000300
9,0.000000000000962,0.000000000000971,0.000000000000084,-0.000000000000647
10,-0.000000000000628,-0.000000000001176,-0.000000000000360,0.000000000000868
11,0.000000000000312,0.000000000000200,-0.000000000001107,-0.000000000002945
12,-0.000000000000011,0.000000000000537,0.000000000001600,0.000000000000920
13,-0.000000000000186,0.000000000001270,-0.000000000001048,-0.000000000003083
14,0.000000000001674,0.000000000000074,0.000000000000328,0.000000000000056
15,-0.000000000000532,-0.000000000000668,0.000000000000268,0.000000000000837
16,-0.000000000000626,0.000000000000181,-0.000000000000268,0.000000000001158
17,0.000000000000738,-0.000000000001765,-0.000000000000472,-0.000000000000305
18,-0.000000000001061,-0.000000000001318,-0.000000000000294,-0.000000000000240
19,0.000000000001385,0.000000000000918,0.000000000000596,-0.000000000002414
And now write to_csv using the float_format='%f' parameter
df.to_csv('estc.csv',sep=',', float_format='%f') # this will remove the extra zeros after the '.'
For more details check pandas.DataFrame.to_csv
Use the float_format argument:
In [11]: df = pd.DataFrame(np.random.randn(3, 3) * 10 ** 12)
In [12]: df
Out[12]:
0 1 2
0 1.757189e+12 -1.083016e+12 5.812695e+11
1 7.889034e+11 5.984651e+11 2.138096e+11
2 -8.291878e+11 1.034696e+12 8.640301e+08
In [13]: print(df.to_string(float_format='{:f}'.format))
0 1 2
0 1757188536437.788086 -1083016404775.687134 581269533538.170288
1 788903446803.216797 598465111695.240601 213809584103.112457
2 -829187757358.493286 1034695767987.889160 864030095.691202
Which works similarly for to_csv:
df.to_csv('df.csv', float_format='{:f}'.format, encoding='utf-8')
If you would like to use the values as formated string in a list, say as part of csvfile csv.writier, the numbers can be formated before creating a list:
with open('results_actout_file','w',newline='') as csvfile:
resultwriter = csv.writer(csvfile, delimiter=',')
resultwriter.writerow(header_row_list)
resultwriter.writerow(df['label'].apply(lambda x: '%.17f' % x).values.tolist())