how to get the desired vertical list in python? - python

how to read a file vertically? So for instance a file would contain the following:
1234
4567
7890
to obtain [147, 258, 369 479]
this was used
rows = [line.split() for line in f]
columns=zip(*rows)
print(columns)
and the following was obtained
zip object
what should i do to fix it? so that i get the desired result

In Python 3 zip returns a zip object, useful for iterating over but not so much for printing. The fix is easy though:
print(list(columns))

Here your code
lines = "1234 4567 7890"
rows_iter = [iter(s) for s in lines.split()]
cols_as_list = zip(*strs_iter)
cols = [''.join(c) for c in cols_as_list]

Given your sample data in a file called 'foop.txt', here you go:
z = zip(*(l for l in open('foop.txt')))
columns = [''.join(x) for x in z]
print columns
results in:
['147', '258', '369', '470']
If you want to leave 'columns' as a generator one would just change that line:
columns = (''.join(x) for x in z)

It looks like you are using Python 3, where zip returns an iterator. To see the values, you need to consume the iterator, e.g. using the list constructor:
columns = list(zip(*rows))
To obtain individual columns in individual variables, you can unpack them:
col1, col2, ... = zip(*rows)
If the file really has only a single column, there is no reason to call split in the first place. Simply read the lines into a list:
col = [int(line) for line in f] # or float(line)...

Related

How to convert and merge the values obtained from stats.linregress to csv in python?

From the following code, I filtered a dataset and obtained their statistic values using stats.linregress(x,y). I would like to merge the obtained values in the lists to a table and then covert it to csv. How to merge the lists? I tried the .append() but then it adds [...] at the end of each list. How to convert these lists in one csv? The code below only convert the last list to csv. Also, where is appropriate to add .02f function to shorten down the digits? Many thanks!
for i in df.ingredient.unique():
mask = df.ingredient == i
x_data = df.loc[mask]["single"]
y_data = df.loc[mask]["total"]
ing = df.loc[mask]["ingredient"]
res = stats.linregress(x_data, y_data)
result_list=list(res)
#sum_table = result_list.append(result_list)
sum_table = result_list
np.savetxt("sum_table.csv", sum_table, delimiter=',')
#print(f"{i} = res")
#print(f"{i} = result_list")
output:
[0.555725080482033, 15.369647540612188, 0.655901508882146, 0.34409849111785396, 0.45223586826559015, [...]]
[0.8240446598271236, 16.290731244189164, 0.7821893273053173, 0.00012525348188386877, 0.16409500805404134, [...]]
[0.6967783360917531, 25.8981921144781, 0.861561500951743, 0.13843849904825695, 0.29030899523536124, [...]]

Explode pandas dataframe list of lists (of points) into dataframe with pairs of points

I have a pandas dataframe, bike_path_df, that contains a few columns, of which one is called coordinates. The format of coordinates is a list of lists, where each inner list is a pair of latitude and longitude coordinates, and the i'th and i+1'th element of the list of lists denote a straight line bike path segment connecting the two points i and i+1.
For example:
bike_path_df.iloc[0]['coordinates']
yields the following:
[[149.12482362501234, -35.17695800091904], # Point A of line 1
[149.12481244481404, -35.177008392939385], # Point B of line 1, point A of line 2
[149.12480556675655, -35.17703489702785], # Point B of line 2, point A of line 3
[149.12481021458206, -35.17706139012856], # etc...
[149.12483798252785, -35.17709736965295],
[149.12489568437493, -35.17714846206322]]
After some effort, I've written a clumsy loop that will allow me to pair each point with it's neighbours:
all_list = []
for list_of_points in bike_paths_df['coordinates']:
result = [ [ list_of_points[i], list_of_points[i+1] ] for i,j in enumerate(list_of_points) if i+1 < len(list_of_points) ]
all_list.append(result)
The output from the above resembles something like
[[[149.12482362501234, -35.17695800091904],[149.12481244481404, -35.177008392939385]],
[149.12481244481404, -35.177008392939385], [149.12480556675655, -35.17703489702785]],
...]]]
But converting all_list to a pd.Series object can return NaN when I try to add it back to the original dataframe (I believe because Series is expanding the list of lists, so the shapes are no longer equal).
Ideally I'd like to have each set of four points on a dataframe row, with the other data for that path repeated for each set of four points, such that it would resemble:
>>bike_path_df.head()
name coordinate_pair
Path1 [A1_long, A1_lat, B1_long, B1_lat]
Path1 [B1_long, B1_lat, C1_long, C1_lat]
Path1 [C1_long, C1_lat, D1_long, D1_lat]
Path1 [D1_long, D1_lat, E1_long, E1_lat]
Path2 [A2_long, A2_lat, B2_long, B2_lat]
Path2 [B2_long, B2_lat, C2_long, C3_lat]
...
Does anyone have any advice?
I've also uploaded a few rows of the actual data I'm working with in CSV format here: https://github.com/Ecaloota/BikePathInfrastructure-ACT as "bike_paths_progress.csv"
Thank you!
IIUC, use zip and explode
df = pd.read_csv('bike_paths_progress.csv', index_col=0)
df['coordinates'] = pd.eval(df['coordinates'])
df = df.join(df['coordinates'].apply(lambda x: [[i[0], i[1], j[0], j[1]]
for i, j in zip(x, x[1:])])
.explode().rename('coordinate_pair'))
Output:
>>> df.loc[81, 'coordinate_pair']
81 [149.12482362501234, -35.17695800091904, 149.12481244481404, -35.17700839293...
81 [149.12481244481404, -35.177008392939385, 149.12480556675655, -35.1770348970...
81 [149.12480556675655, -35.17703489702785, 149.12481021458206, -35.17706139012...
81 [149.12481021458206, -35.17706139012856, 149.12483798252785, -35.17709736965...
81 [149.12483798252785, -35.17709736965295, 149.12489568437493, -35.17714846206...
Name: coordinate_pair, dtype: object

Loop through list of dataframes and save as new dataframe name

I'm trying to loop through a list of dataframes and perform operations on them. In the final command I want to rename the dataframe as the original key plus '_rand_test'. I'm getting the error:
SyntaxError: cannot assign to operator
Is there a way to do this?
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
seg_name[i]+'_rand_test' = pd.concat([control,test])
The error is because you are trying to perform addition on the left side of an = sign, which you can never do. If you want to rename the dataframe you could just do it on the next line. I'm unsure of what exactly you're trying to rename based off of the code, but if it's just the corresponding string in the seg_name list then the next line would look like this:
seg_name[segments.index(i)] += 'rand_test'
The reason for the segments.index(i) is because you're looping over the elements in segments, not their indexes, so you need to get the index of the element.
Maybe this will work for you?
Create an empty list befor you run the loop and fill that list with append function. And then you rename all the elements of the new list.
segments = [main_h, main_m, main_l]
seg_name = ['main_h', 'main_m', 'main_l']
new_list= []
for i in segments:
control = pd.DataFrame(i.groupby('State', group_keys=False).apply(lambda x : x.sample(frac = .1)))
control['segment'] = 'control'
test= i[~i.index.isin(control.index)]
test['segment'] = 'test'
new_list.append(df)
new_names_list=[item +'_rand_test' for item in new_list]

Retrieve dataframe row based on list from a cell value

I am trying to retrieve a row from a pandas dataframe where the cell value is a list. I have tried isin, but it looks like it is performing OR operation, not AND operation.
>>> import pandas as pd
>>> df = pd.DataFrame([['100', 'RB','stacked'], [['101','102'], 'CC','tagged'], ['102', 'S+C','tagged']],
columns=['vlan_id', 'mode' , 'tag_mode'],index=['dinesh','vj','mani'])
>>> df
vlan_id mode tag_mode
dinesh 100 RB stacked
vj [101, 102] CC tagged
mani 102 S+C tagged
>>> df.loc[df['vlan_id'] == '102']; # Fetching string value match
vlan_id mode tag_mode
mani 102 S+C tagged
>>> df.loc[df['vlan_id'].isin(['100','102'])]; # Fetching if contains either 100 or 102
vlan_id mode tag_mode
dinesh 100 RB stacked
mani 102 S+C tagged
>>> df.loc[df['vlan_id'] == ['101','102']]; # Fails ?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1283, in wrapper
res = na_op(values, other)
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1143, in na_op
result = _comp_method_OBJECT_ARRAY(op, x, y)
File "C:\Python27\lib\site-packages\pandas\core\ops.py", line 1120, in _comp_method_OBJECT_ARRAY
result = libops.vec_compare(x, y, op)
File "pandas\_libs\ops.pyx", line 128, in pandas._libs.ops.vec_compare
ValueError: Arrays were different lengths: 3 vs 2
I can get the values to a list and compare it. Instead, Is there any way available where we can check it against a list value using .loc method itself?
To find a list you can iterate over the values of vlan_id and compare each value using np.array_equal:
df.loc[[np.array_equal(x, ['101','102']) for x in df.vlan_id.values]]
vlan_id mode tag_mode
vj [101, 102] CC tagged
Although, it's advised to avoid using lists as cell values in a dataframe.
DataFrame.loc can use a list of labels or a Boolean array to access rows and columns. The list comprehension above contructs a Boolean array.
I am not sure if this is the best way to do this, or if there is a good way to do this, since as far as I know pandas doesn't really support storing lists in Series. Still:
l = ['101', '102']
df.loc[pd.concat([df['vlan_id'].str[i] == l[i] for i in range(len(l))], axis=1).all(axis=1)]
Output:
vlan_id mode tag_mode
vj [101, 102] CC tagged
Another workaround would be to transform your vlan_id columns so that it can be queried as a string. You can do that by joining your vlan_id list values into comma-separated strings.
df['proxy'] = df['vlan_id'].apply(lambda x: ','.join(x) if type(x) is list else ','.join([x]) )
l = ','.join(['101', '102'])
print(df.loc[df['proxy'] == l])

Looping at specified indexes

I have 2 large lists, each with about 100 000 elements each and one being larger than the other, that I want to iterate through. My loop looks like this:
for i in list1:
for j in list2:
function()
This current looping takes too long. However, list1 is a list that needs to be checked from list2 but, from a certain index, there are no more instances beyond in list2. This means that looping from indexes might be faster but the problem is I do not know how to do so.
In my project, list2 is a list of dicts that have three keys: value, name, and timestamp. list1 is a list of the timestamps in order. The function is one that takes the value based off the timestamp and puts it into a csv file in the appropriate name column.
This is an example of entries from list1:
[1364310855.004000, 1364310855.005000, 1364310855.008000]
This is what list2 looks like:
{"name":"vehicle_speed","value":2,"timestamp":1364310855.004000}
{"name":"accelerator_pedal_position","value":4,"timestamp":1364310855.004000}
{"name":"engine_speed","value":5,"timestamp":1364310855.005000}
{"name":"torque_at_transmission","value":-3,"timestamp":1364310855.008000}
{"name":"vehicle_speed","value":1,"timestamp":1364310855.008000}
In my final csv file, I should have something like this:
http://s000.tinyupload.com/?file_id=03563948671103920273
If you want this to be fast, you should restructure the data that you have in list2 in order to speedup your lookups:
# The following code converts list2 into a multivalue dictionary
from collections import defaultdict
list2_dict = defaultdict(list)
for item in list2:
list2_dict[item['timestamp']].append((item['name'], item['value']))
This gives you a much faster way to look up your timestamps:
print(list2_dict)
defaultdict(<type 'list'>, {
1364310855.008: [('torque_at_transmission', -3), ('vehicle_speed', 0)],
1364310855.005: [('engine_speed', 0)],
1364310855.004: [('vehicle_speed', 0), ('accelerator_pedal_position', 0)]})
Lookups will be much more efficient when using list2_dict:
for i in list1:
for j in list2_dict[i]:
# here j is a tuple in the form (name, value)
function()
You appear to only want to use the elements in list2 that correspond to i*2 and i*2+1, That is elements 0, 1 and 2, 3, ...
You only need one loop.
for i in range(len(list1)):
j = list[i*2]
k = list2[j+1]
# Process function using j and k
You will only process to the end of list one.
i think pandas module is a perfect match for your goals...
import ujson # 'ujson' (Ultra fast JSON) is faster than the standard 'json'
import pandas as pd
filter_list = [1364310855.004000, 1364310855.005000, 1364310855.008000]
def file2list(fn):
with open(fn) as f:
return [ujson.loads(line) for line in f]
# Use pd.read_json('data.json') instead of pd.DataFrame(load_data('data.json'))
# if you have a proper JSON file
#
# df = pd.read_json('data.json')
df = pd.DataFrame(file2list('data.json'))
# filter DataFrame with 'filter_list'
df = df[df['timestamp'].isin(filter_list)]
# convert UNIX timestamps to readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
# pivot data frame
# fill NaN's with zeroes
df = df.pivot(index='timestamp', columns='name', values='value').fillna(0)
# save data frame to CSV file
df.to_csv('output.csv', sep=',')
#pd.set_option('display.expand_frame_repr', False)
#print(df)
output.csv
timestamp,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0.0,0.0,-3.0,1.0
PS i don't know where did you get [Latitude,Longitude] columns from, but it's pretty easy to add those columns to your result DataFrame - just add the following lines before calling df.to_csv()
df.insert(0, 'latitude', 0)
df.insert(1, 'longitude', 0)
which would result in:
timestamp,latitude,longitude,accelerator_pedal_position,engine_speed,torque_at_transmission,vehicle_speed
2013-03-26 15:14:15.004,0,0,4.0,0.0,0.0,2.0
2013-03-26 15:14:15.005,0,0,0.0,5.0,0.0,0.0
2013-03-26 15:14:15.008,0,0,0.0,0.0,-3.0,1.0

Categories