Python list elements showing individual characters (Pandas DF from CSV) - python

I have a .csv with a 'cities' column. The column's values are supposed to be a list with each element being a list itself in the following format:
['City', (latitude, longitude)]
So for example:
[['Athens', (37.9839412, 23.7283052)], ['Heraklion', (35.3400127, 25.1343475)], ['Mykonos', (37.45142265, 25.392303200095327)]]
I am trying to load the csv into a pandas dataframe using pd.read_csv().
The value in the column ends up with type string and looks like this:
'[[\'Athens\', (37.9839412, 23.7283052)], [\'Heraklion\', (35.3400127, 25.1343475)], [\'Mykonos\', (37.45142265, 25.392303200095327)]]'
However, because its a string, its just seeing each element as one character.
When I do:
for i in cities:
print(i)
Or:
list(cities)
I get:
[
[
'
A
t
h
e
n
s
'
,
(
3
7
.
9
8
3
9
4
1
2
,
2
3
.
7
2
8
3
0
5
2
)
]
,
etc.
I am looking for a way to 're-build' the data back into python list format so that I can access the string 'Athens' with df.loc[0]['cities][0] and the tuple (37.9839412, 23.7283052) with df.loc[0]['cities][1].
I have tried df['cities'].astype(list) which results in the error:
TypeError: dtype '<class 'list'>' not understood

It seems the data is a string that looks like a Python array. You can access the values for this using ast.literal_eval by applying this literal_eval function on every row in the DataFrame and storing the output city and coordinates as separate columns in the DataFrame.

What's happening is that it's currently a string, rather than a list. Simply implement this:
import ast
l = '[[\'Athens\', (37.9839412, 23.7283052)], [\'Heraklion\', (35.3400127, 25.1343475)], [\'Mykonos\', (37.45142265, 25.392303200095327)]]'
res = ast.literal_eval(l)
print(res)
print(type(res))
Output:
[['Athens', (37.9839412, 23.7283052)], ['Heraklion', (35.3400127, 25.1343475)], ['Mykonos', (37.45142265, 25.392303200095327)]]
<class 'list'>

Related

How do I convert binary data to hexadecimal through Python?

So I'm trying to get the binary data field in a database as a hexadecimal string. I don't know if that is exactly what I'm looking for but that is my hunch. There is a column called status in my dataframe that says [binary data] in PostgreSQL but when I execute the following command it looks like this:
df = pd.read_sql_query("""SELECT * FROM public."Vehtek_SLG" LIMIT 1000""",con=engine.connect())
How do I get the actual data in that column?
It looks like your DataFrame has for each row a list of individual bytes instead of the entire hexadecimal bytes string. The Series df["status"].map(b"".join) will have the concatenated bytes strings.
import random
import pandas as pd
# Simulating lists of 10 bytes for each row
df = pd.DataFrame({
"status": [
[bytes([random.randint(0, 255)]) for _ in range(10)]
for _ in range(5)
]
})
s = df["status"].map(b"".join)
Both objects look like:
# df
status
0 [b'\xb3', b'f', b';', b'P', b'\xcb', b'\x9b', ...
1 [b'\xd2', b'\xe8', b'.', b'b', b'g', b'|', b'\...
2 [b'\xa7', b'\xe1', b'z', b'-', b'W', b'\xb8', ...
3 [b'\xc5', b'\xa9', b'\xd5', b'\xde', b'\x1d', ...
4 [b'\xa3', b'b', b')', b'\xe3', b'5', b'`', b'\...
# s
0 b'\xb3f;P\xcb\x9bi\xb0\x9e\xfd'
1 b'\xd2\xe8.bg|\x94O\x90\n'
2 b'\xa7\xe1z-W\xb8\xc2\x84\xb91'
3 b'\xc5\xa9\xd5\xde\x1d\x02*}I\x15'
4 b'\xa3b)\xe35`\x0ed#g'
Name: status, dtype: object
After coverting the status field to binary we can then use the following to make it hexadecimal.
df['status'] = s.apply(bytes.hex)
And now here is your field!
df['status'].head()
0 1f8b0800000000000400c554cd8ed33010beafb4ef6045...
1 1f8b0800000000000400c554cd8ed33010beafb4ef6045...
2 1f8b0800000000000400c554cd6e9b4010be47ca3bac50...
3 1f8b0800000000000400c554cd6e9b4010be47ca3bac50...
4 1f8b0800000000000400c554cd6e9b4010be47ca3bac50...

How to unlist a list in dataframe column?

i have a dataframe column codes as below
codes
-----
[K70, X090a2, T8a981,X090a2]
[A70, X90a2, T8a91,A70,A70]
[B70, X09a2, T8a81]
[C70, X00a2, T8981,X00a2,C70]
i want output like this in a dataframe.
need to check any duplicates and return only unique values and then need to unlist.
dict.fromkeys(z1['codes']) used this bcos keys doesn't have duplicates
and tried with for loop by count didn't get the expected results
output column:
codes
-----
K70 X090a2 T8a981
A70 X90a2 T8a91
B70 X09a2 T8a81
C70 X00a2 T8981
If in column are lists deduplicated with dict.fromkeys and then join by whitespace:
#if values are strings
#z1['codes'] = z1['codes'].str.strip('[]').str.split(',\s*')
z1['codes'] = z1['codes'].apply(lambda x: ' '.join(dict.fromkeys(x).keys()))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981
Set will remove duplicates from a list and join will unlist the list into a string with a whitespace.
z1['codes'].apply(lambda code: " ".join(set(code)))
print (z1)
codes
0 K70 X090a2 T8a981
1 A70 X90a2 T8a91
2 B70 X09a2 T8a81
3 C70 X00a2 T8981

pandas: how to select an all rows of a dataframe that meet a condition (ValueError: "Arrays were different lengths")

Python 2.7
I have a Dataframe with two columns, coordinates and loc. coordinates contains 10 lat/long pairs and loc contains 10 strings.
The following code leads to a ValueError, arrays were different lengths. Seems like I'm not writing the condition correctly.
lst_10_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372'], ['37.226582, -95.70522299999999']]
lst_10_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX'], ['Seattle, WA'], ['Roswell, GA'], ['Texas'], ['null'], ['??, passing by...'], ['null']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_10_cords
df['locs'] = lst_10_locs
print df
df = df[df['coordinates'] != ['37.226582', '-95.70522299999999']] #ValueError
The error message is
File "C:\Users...\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin
e 1283, in wrapper
res = na_op(values, other)
File "C:\Users...\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin
e 1143, in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
File "C:...\biney\Miniconda3\envs\py2.7\lib\site-packages\pandas\core\ops.py", lin
e 1120, in _comp_method_OBJECT_ARRAY
result = libops.vec_compare(x, y, op)
File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 10 vs 2
My goal here is to actually check and eliminate all entries in the coordinates column that are equal to the list [37.226582, -95.70522299999999] so I want df['coordinates'] to print out [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['29.7604267, -95.3698028'], ['47.6062095, -122.3320708'], ['34.0232431, -84.3615555'], ['31.9685988, -99.9018131'], ['37.226582, -95.70522299999999'], ['40.289918, -83.036372']
I was hoping that this documentation would help, particularly the part that shows:
"You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):"
df[df['A'] > 0]
so it seems like I'm not quite getting the syntax right... But I'm stuck. How do I write set a condition for the cell value of a certain column and return a dataframe only containing rows with cells that meet that condition?
can you consider this?:
df
coordinates locs
0 [37.09024, -95.712891] [United States]
1 [-37.605, 145.146] [Doreen, Melbourne]
2 [43.0481962, -76.0488458] [Upstate NY]
3 [29.7604267, -95.3698028] [Houston, TX]
4 [47.6062095, -122.3320708] [Seattle, WA]
5 [34.0232431, -84.3615555] [Roswell, GA]
6 [31.9685988, -99.9018131] [Texas]
7 [37.226582, -95.705222999] [null]
8 [40.289918, -83.036372] [??, passing by...]
9 [37.226582, -95.7052229999] [null]
df['lat'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[0]))
df['lon'] = df['coordinates'].map(lambda x: np.float(x[0].split(",")[1]))
df[~((np.isclose(df['lat'],37.226582)) & (np.isclose(df['lon'],-95.70522299999999)))]
coordinates locs lat lon
0 [37.09024, -95.712891] [United States] 37.090240 -95.712891
1 [-37.605, 145.146] [Doreen, Melbourne] -37.605000 145.146000
2 [43.0481962, -76.0488458] [Upstate NY] 43.048196 -76.048846
3 [29.7604267, -95.3698028] [Houston, TX] 29.760427 -95.369803
4 [47.6062095, -122.3320708] [Seattle, WA] 47.606209 -122.332071
5 [34.0232431, -84.3615555] [Roswell, GA] 34.023243 -84.361555
6 [31.9685988, -99.9018131] [Texas] 31.968599 -99.901813
8 [40.289918, -83.036372] [??, passing by...] 40.289918 -83.036372
One issue if you look into the objects your dataframe is storing the coords as you see that it is a single string. the issue with the error you are getting seems to be that it is comparing the 10 element series .coordinates with a 2 element list and there is obviously a mismatch. using .values seemed to get around that.
df2 = pd.DataFrame([row if row[0]!= ['37.226582, -95.70522299999999'] else [np.nan, np.nan] for row in df.values ], columns=['coords', 'locs']).dropna()
ok here is an approach to ensure you have clean data to operate on.
let's assume 4 entries with a dirty coordinate entry.
lst_4_cords = [['37.09024, -95.712891'], ['-37.605, 145.146'], ['43.0481962, -76.0488458'], ['null']]
lst_4_locs = [['United States'], ['Doreen, Melbourne'], ['Upstate NY'], ['Houston, TX']]
df = pd.DataFrame(columns=['coordinates', 'locs'])
df['coordinates'] = lst_4_cords
df['locs'] = lst_4_locs
coordinates locs
0 [37.09024, -95.712891] [United States]
1 [-37.605, 145.146] [Doreen, Melbourne]
2 [43.0481962, -76.0488458] [Upstate NY]
3 [null] [Houston, TX]
now we make a cleaning method. You would really want to test the values using:
type(value) is list.
type(value[0]) is string.
value[0].split(",") has two elements
each element can cast to float - etc.
Each is valid to be a lat or a lon
However we will do it the dirty way using a try except.
def scrubber_drainer(value):
try:
# we assume value is a list, with a single string in position zero, that string has a comma, that we can split into a tuple of two floats
return tuple([float(value[0].split(",")[0]),float(value[0].split(",")[1])])
except:
# return tuple (38.9072,77.0396) # swamp
return tuple([0.0,0.0]) # some default
so the return is typically a tuple with 2 floats. If it can't become that we return a default (0.,0.).
now update the coordinates
df['coordinates'] = df['coordinates'].map(scrubber_drainer)
then we use this cool technique to split out the tuple
df[['lat', 'lon']] = df['coordinates'].apply(pd.Series)
and now you can use the np.isclose() to filter

Convert lat/long coordinates in a pandas Series to list of lists

I have a column in pandas called 'coords'. It has multiple comma delimited longitude + 'space' + latitude values in each row.
A sample row for the 'coords' column would appear like below...
[-88.12166374975578 42.13019789209025, -88.12166297898594 42.130077282796826, -88.12166229779616 42.12997073740438, -88.12165682902426 42.129114208546525, -88.12165440666122 42.12867029753218]
I would like to create a list of lists from the list. So that it would appear like this...
[[-88.12166374975578, 42.13019789209025], [-88.12166297898594 ,42.130077282796826], [-88.12166229779616, 42.12997073740438], [-88.12165682902426,42.129114208546525], [-88.12165440666122, 42.12867029753218]]
How can I convert df['coords'] to the list of lists?
Here is a head()...
coords
0 -88.12166374975578 42.13019789209025, -88.12166297898594 42.130077282796826, -88.12166229779616 42.12997073740438, -88.12165682902426 42.129114208546525, -88.12165440666122 42.12867029753218, -88.12165409167278 42.12861210461891, -88.12165078955562 42.1280072560737, -88.1216505237599 42.127958648542936, -88.12164976861018 42.127820070569165, -88.12164950156834 42.127770730347784, -88.12164936198349 42.127745113495685, -88.12164631909246 42.12698047923614, -88.12164465148149 42.126561239318384, -88.12164441208937 42.126501380826646, -88.12165535387125 42.125918676152615, -88.12165901489989 42.1257236125411, -88.12166910482216 42.125179681003004, -88.12167046792653 42.12511347549821, -88.12168153859359 42.124574951678966, -88.12169213266428 42.12405994975595, -88.12169609920953 42.123867...
1 -88.15806483536268 42.15423929791892, -88.15734814434225 42.15424023425998, -88.15692561771552 42.15424078182948, -88.15612280604331 42.15424182229812, -88.15570230201315 42.154247060953836, -88.15537304882349 42.15424548051985, -88.15424894139665 42.15424008174756, -88.15312432528388 42.15423466567452, -88.15200516375596 42.15422926640768, -88.15075402101326 42.1542232181898, -88.15074137162432 42.15422315689777, -88.15073738857417 42.15384470168878, -88.1507388608806 42.15329655518857, -88.15074017125366 42.15246856985761, -88.15074053615406 42.15224538180373, -88.15074152744889 42.151633597914206, -88.15074252669456 42.15055197422978, -88.15074334980639 42.15033614385567, -88.15074448165737 42.15003982848825, -88.15074567060333 42.14972749019171, -88.15074611950101 42.14952766024307...
Assuming what you showed is an excerpt of the Coords column, you can use pd.Series.str.split:
coords = df.Coords
print(coords)
0 -88.12166374975578 42.13019789209025
1 -88.12166297898594 42.130077282796826
2 -88.12166229779616 42.12997073740438
3 -88.12165682902426 42.129114208546525
4 -88.12165440666122 42.12867029753218
dtype: object
list_ = coords.str.split(expand=True).applymap(float).values.tolist()
print(list_)
[[-88.12166374975578, 42.13019789209025],
[-88.12166297898594, 42.130077282796826],
[-88.12166229779616, 42.12997073740438],
[-88.12165682902426, 42.129114208546525],
[-88.12165440666122, 42.12867029753218]]
Edited solution:
print(coords)
coords
0 -88.12166374975578 42.13019789209025, -88.1216...
1 -88.15806483536268 42.15423929791892, -88.1573...
out = df.coords.str.split(',\s+').apply(pd.Series).stack()\
.str.split(expand=True).applymap(float).values.tolist()
print(out)
[[-88.12166374975578, 42.13019789209025],
[-88.12166297898594, 42.130077282796826],
[-88.12166229779616, 42.12997073740438],
[-88.12165682902426, 42.129114208546525],
[-88.12165440666122, 42.12867029753218],
[-88.12165409167278, 42.12861210461891],
[-88.12165078955562, 42.1280072560737],
[-88.1216505237599, 42.127958648542936],
[-88.12164976861018, 42.127820070569165],
[-88.12164950156834, 42.127770730347784],
[-88.12164936198349, 42.127745113495685],
[-88.12164631909246, 42.12698047923614],
[-88.12164465148149, 42.126561239318384],
[-88.12164441208937, 42.126501380826646],
[-88.12165535387125, 42.125918676152615],
[-88.12165901489989, 42.1257236125411],
[-88.12166910482216, 42.125179681003004],
[-88.12167046792653, 42.12511347549821],
[-88.12168153859359, 42.124574951678966],
[-88.12169213266428, 42.12405994975595],
[-88.12169609920953, 42.123867],
[-88.15806483536268, 42.15423929791892],
[-88.15734814434225, 42.15424023425998],
[-88.15692561771552, 42.15424078182948],
[-88.15612280604331, 42.15424182229812],
[-88.15570230201315, 42.154247060953836],
[-88.15537304882349, 42.15424548051985],
[-88.15424894139665, 42.15424008174756],
[-88.15312432528388, 42.15423466567452],
[-88.15200516375596, 42.15422926640768],
[-88.15075402101326, 42.1542232181898],
[-88.15074137162432, 42.15422315689777],
[-88.15073738857417, 42.15384470168878],
[-88.1507388608806, 42.15329655518857],
[-88.15074017125366, 42.15246856985761],
[-88.15074053615406, 42.15224538180373],
[-88.15074152744889, 42.151633597914206],
[-88.15074252669456, 42.15055197422978],
[-88.15074334980639, 42.15033614385567],
[-88.15074448165737, 42.15003982848825],
[-88.15074567060333, 42.14972749019171],
[-88.15074611950101, 42.14952766024307]]

How to extract content from the regex output which has square bracket in python

I have a Python's (2.7) Pandas DF which has columns which looks something like this :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com']
I want to extract email from it without the square bracket and single quotes. Output should like this :
email
jsaw#yahoo.com
jfsjhj#yahoo.com
jwrk#yahoo.com
rankw#yahoo.com
I have tried the suggestions from this answer :Replace all occurrences of a string in a pandas dataframe (Python) . But its not working. Any help will be appreciated.
edit:
What if I have array of more than 1 dimension. something like :
email
['jsaw#yahoo.com']
['jfsjhj#yahoo.com']
['jwrk#yahoo.com']
['rankw#yahoo.com','fsffsnl#gmail.com']
['mklcu#yahoo.com','riserk#gmail.com', 'funkdl#yahoo.com']
is it possible to get the output in three different columns without square brackets and single quotes.
You can use str.strip if type of values is string:
print type(df.at[0,'email'])
<type 'str'>
df['email'] = df.email.str.strip("[]'")
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
If type is list apply Series:
print type(df.at[0,'email'])
<type 'list'>
df['email'] = df.email.apply(pd.Series)
print df
email
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com
EDIT: If you have multiple values in array, you can use:
df1 = df['email'].apply(pd.Series).fillna('')
print df1
0 1 2
0 jsaw#yahoo.com
1 jfsjhj#yahoo.com
2 jwrk#yahoo.com
3 rankw#yahoo.com fsffsnl#gmail.com
4 mklcu#yahoo.com riserk#gmail.com funkdl#yahoo.com
Try this one:
from re import findall
s = "['rankw#yahoo.com']"
m = findall(r"\[([A-Za-z0-9#'._]+)\]", s)
print(m[0].replace("'",''))

Categories