Python Pandas Square Brackets LIST from STRING - python

Warning: I am newbie to Python, Pandas, and PySerial....
I am reading values from an Excel spreadsheet using Pandas.
The values in Excel are stored as Text, but contain both alphabetical and numeric characters.
see Snip of Excel data
I import these using Pandas command mydata = pd.read_excel (*path etc goes here*) <<< (no problems are encountered with this function)
I can then print them using print(mydata) ....and the output looks the same as it appears in the Excel spreadsheet (i.e., there are no extra characters):
0 MW000000007150000300000;
1 MW000100009850000200000;
2 MW000200009860000200000; #<<<<<<<< *Notice that there are NO square brackets and no extra Quotes*.
To send these data via the PySerial function serial.write to my RS-232 linked device, I am looping through the values which must (as I understand it...) be in a LIST format. So, I convert the data-field mydata into a LIST, by using the command Allocation_list=mydata.values.tolist()
If I print(Allocation_list), I find many square brackets and single quotes have been added, as you can see here:
Allocation_list =([['MW000000007150000300000;'], ['MW000100009850000200000;'], ['MW000200009860000200000;'], ['MW000300009870000200000;'], ['MW000400009880000200000;'], ['MW000500009890000200000;']])
These square brackets are NOT ignored when I <<serial.write>> the values in the LIST to my RS-232 device.
In fact, the values are written as (binary versions of....)
0 memory written as ['MW000000007150000300000;']
1 memory written as ['MW000100009850000200000;']
2 memory written as ['MW000200009860000200000;']
3 memory written as ['MW000300009870000200000;']
4 memory written as ['MW000400009880000200000;']
5 memory written as ['MW000500009890000200000;']
Unfortunately, for the RS-232 device to accept each of the lines written to it as a acceptable command, they must be in the precise command format for that device, which looks like
MW000000007150000300000; <<<<< the semi-colon is a required part of the syntax
So, the square brackets and the Quotation marks have to be removed, somehow.
Any help with this peculiar problem would be appreciated, as I have tried several of the methods described in other 'threads', and none of them seem to work properly because my datafield is a set of strings (which are converted to bits ONLY as they are about to be written to the RS-232 device).
M

Even if you have a frame with just one column avoid this:
l = df.values.tolist()
l
#outputs:
[[40], [10], [20], [10], [15], [30]]
To avoid the issue include a column when outputting to a list:
l = df['amount'].to_list()
l
#outputs:
[40, 10, 20, 10, 15, 30]
If you want a range of rows use loc:
#put rows 3 to 5 (note the index starts at 0!) for only column 'amount' into a list
l = df.loc[2:4,'amount'].to_list()
l
#outputs:
[20, 10, 15]
Showing the code in full on a frame with only one column:

First off, values preserves the dimensionality of the object it's called upon, so you have to target the exact column that holds the serials, something like mydata["column_label"] (just check the relevant column label by printing the dataframe).
As for quotes, pyserial write() accepts bytes-like objects, so you might need to pass an encoded version of your string, using either b'string' or 'string'.encode("utf8") notation.

Related

Get dummy variables from a string column full of mess

I'm a less-than-a-week beginner in Python and Data sciences, so please forgive me if these questions seem obvious.
I've scraped data on a website, but the result is unfortunately not very well formatted and I can't use it without transformation.
My Data
I have a string column which contains a lot of features that I would like to convert into dummy variables.
Example of string : "8 équipements & optionsextérieur et châssisjantes aluintérieurBluetoothfermeture électrique5 placessécuritékit téléphone main libre bluetoothABSautreAPPUI TETE ARclimatisation"
What I would like to do
I would like to create a dummy colum "Bluetooth" which would be equal to one if the pattern "bluetooth" is contained in the string, and zero if not.
I would like to create an other dummy column "Climatisation" which would be equal to one if the pattern "climatisation" is contained in the string, and zero if not.
...etc
And do it for 5 or 6 patterns which interest me.
What I have tried
I wanted to use a match-test with regular expressions and to combine it with pd.getdummies method.
import re
import pandas as pd
def match(My_pattern,My_strng):
m=re.search(My_pattern,My_strng)
if m:
return True
else:
return False
pd.getdummies(df["My messy strings colum"], ...)
I haven't succeeded in finding how to settle pd.getdummies arguments to specify the test I would like to apply on the column.
I was even wondering if it's the best strategy and if it wouldn't be easier to create other parallels columns and apply a match.group() on my messy strings to populate them.
Not sure I would know how to program that anyway.
Thanks for your help
I think one way to do this would be:
df.loc[df['My messy strings colum'].str.contains("bluetooth", na=False),'Bluetooth'] = 1
df.loc[~(df['My messy strings colum'].str.contains("bluetooth", na=False)),'Bluetooth'] = 0
df.loc[df['My messy strings colum'].str.contains("climatisation", na=False),'Climatisation'] = 1
df.loc[~(df['My messy strings colum'].str.contains("climatisation", na=False)),'Climatisation'] = 0
The tilde (~) represents not, so the condition is reversed in this case to string does not contain.
na = false means that if your messy column contains any null values, these will not cause an error, they will just be assumed to not meet the condition.

Pythonic equivalent to Matlab's textscan

There are some similar questions to this, but nothing exact that I can find.
I have a very odd text-file with lines like the following:
field1=1; field2=2; field3=3;
field1=4; field2=5; field3=6;
Matlab's textscan() function deals with this very neatly, as you can do this:
array = textscan(fid, 'field1=%d; field2=%d; field3=%d;'
and you will get back a cell-array where each column contains the respective field, and the text is simply ignored.
I'd like to rewrite the code that deals with this file in Python, but Numpy's loadtxt() and genfromtxt() don't seem to have this ability to ignore text interspersed with the desired numbers?
What are some Python ways to strip out the text and only get back the fields? I'm happy to use pandas or another library if required. Thanks!
EDIT: This question was suggested as an answer, but it only gives equivalents to the basic usage of textscan that does not deal with unwanted text in the input. The answer below with fromregex is what I needed.
Numpy's fromregex function is basically the same as textscan. It lets you read in based on a regular expression, with groups (parts surrounded by ()) as the values. This works for your example:
data = np.fromregex('temp.txt', r'field1=(\d+); field2=(\d+); field3=(\d+);', dtype='int')
You can also use loadtxt. There is an argument, converters, that lets you provide functions that do the actual conversion from text to a number. You can provide a function that , you just need to provide it a function to strip out the unneeded text.
So in my tests this works:
myconv = lambda x: int(x.split(b'=')[-1])
mycols = [0, 1, 2]
convdict = {i: myconv for i in mycols}
data = np.loadtxt('temp.txt', delimiter=';', usecols=mycols, converters=convdict)
myconv is an anonymous function that takes a value (say 'field1=1'), splits it on the '=', symbol (making ['field1', '1']), takes the last result ('1'), the converts that to a float (1.`).
mycols is just the numbers of the columns you want to keep. Since there is a delimiter at the end of each line, this counts as an empty columns. So we exclude that.
convdict is a dictionary where each key is a column number and each value is the function to convert that column to a number. In this case they are all the same, but you can customize them however you want.
Python has no exact equivalent of Matlab's textscan (edit: but numpy has fromregex. See #TheBlackCat's answer for more.)
With more complicated formats regular expressions may get the job done.
import re
line_pat = re.compile(r'field1=(\d+); field2=(\d+); field3=(\d+);')
with open(filepath, 'r') as f:
array = [[int(n) for n in line_pat.match(line).groups()] for line in f]

Increasing Values in one Column Using CSV Module and Python

All I would like to do is add .001 to each value that isn't a 0 in one column (say column 7 for example) in my csv file.
So instead of being 35, the value would be changed to 35.001 for example. I need to do this to make my ArcMap script work because if a whole number is the first read, the column is assigned as a short integer when it needs to be read as a float.
As of right now, I have:
writer.writerow([f if f.strip() =='0' else f+.001 for f in row])
This creates a concatenation error however and does not yet address the specific column I need this to work on.
Any help would be greatly appreciated.
Thank you.
The easiest thing to do is to just mutate the row in place, i.e.
if row[7].strip() != '0' and '.' not in row[7]:
row[7] = row[7] + '.001'
writer.writerow(row)
The concatenation error is caused by trying to add a float to a string, you just have to wrap the extra decimals in quotes
The extra condition on the if ensures that you don't accidentally end up with a number with 2 decimal points
It's pretty standard for numbers of the form like 35.0 to be treated as floats even though it's a whole number - check if ArcMap follows this convention, and then you can avoid reducing the accuracy of your numbers by just appending '.0'

Format strings to make 'table' in Python 3

Right now I'm using print(), calling the variables I want that are stored in a tuple and then formatting them using: print(format(x,"<10s")+ format(y,"<40s")...) but this gives me output that isn't aligned in a column form. How do I make it so that each row's element is aligned?
So, my code is for storing student details. First, it takes a string and returns a tuple, with constituent parts like: (name,surname,student ID, year).
It reads these details from a long text file on student details, and then it parses them through a tuplelayout function (the bit which will format the tuple) and is meant to tabulate the results.
So, the argument for the tuplelayout function is a tuple, of the form:
surname | name | reg number | course | year
If you are unpacking tuples just use a single str.format and justify the output as required using format-specification-mini-language:
l = [(10,1000),(200,20000)]
for x,y in l:
print("{:<3} {:<6}".format(x,y))
10 1000
200 20000
My shell has the font settings changed so the alignment was off. Back to font: "Courier" and everything is working fine.
Sorry.

how to only show int in a sorted list from csv file

I have a huge CSV file where im supose to show only the colume "name" and "runtime"
My problem is that i have to sort the file and print the top 10 min and top 10 max from the
row runtime and print them
But the row 'runtime' contains text like this:
['http://dbpedia.org/ontology/runtime',
'XMLSchema#double',
'http://www.w3.org/2001/XMLSchema#double',
'4140.0',
'5040.0',
'5700.0',
'{5940.0|6600.0}',
'NULL',
'6480.0',....n]
how do i sort the list showing only numbers
my code so far:
import csv
run = []
fp = urllib.urlopen('Film.csv')
reader = csv.DictReader(fp,delimiter=',')
for line in reader:
if line:
run.append(line)
name = []
for row in run:
name.append(row['name'])
runtime = []
for row in run:
runtime.append(row['runtime'])
runtime
expected output:
the csv file contaist null values and values looking like this {5940.0|6600.0}
expected output
'4140.0',
'5040.0',
'5700.0',
'6600.0',
'6800.0',....n]
not containg the NULL values and only the higest values in the ones looking
like this
{5940.0|6600.0}
You could filter it like this, but you should probably wait for better answers.
>>>l=[1,1.3,7,'text']
>>>[i for i in l if type(i) in (type(1),type(1.0))] #only ints and floats allowed
[1,1.3,7]
This should do though.
My workflow probably would be: Use str.isdigit() as a filter, convert to a number with BIF int() or float() and then use sort() or sorted().
While you could use one of the many answers that will show up here, I personally would exploit some domain knowledge of your csv file:
runtime = runtime[3:]
Based on your example value for the runtime row, the first three columns contain metadata. So you know more about the structure of your input file than just "it is a csv file".
Then, all you need to do is sort:
runtime = sorted(runtime)
max_10 = runtime[-10:]
min_10 = runtime[:10]
The syntax I'm using here is called "slice", which allows you to access a range of a sequence, by specifying the start index and the "up-to-but-not-including" index in the square brackets separated by a colon. Neat trick: Negative indexes wrap are seen as starting at the end of the sequence.

Categories