parsing tab delimited values from text file to variables - python

Hello I've been struggling with this problem, I'm trying to iterate over rows and select data from them and then assign them to variables. this is the first time I'm using pandas and I'm not sure how to select the data
reader = pd.read_csv(file_path, sep="\t" ,lineterminator='\r', usecols=[0,1,2,9,10],)
for row in reader:
print(row)
#id_number = row[0]
#name = row[2]
#ip_address = row[1]
#latitude = row[9]
and this is the output from the row that I want to assign to the variables:
050000
129.240.228.138
planetlab2.simula.no
59.93
Edit: Perhaps this is not a problem for pandas but for general Python. I am fairly new to python and what I'm trying to achieve is to parse tab separated file line by line and assign data to the variables and print them in one loop.
this is the input file sample:
050263 128.2.211.113 planetlab-1.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown
050264 128.2.211.115 planetlab-3.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown

The general workflow you're describing is: you want to read in a csv, find a row in the file with a certain ID, and unpack all the values from that row into variables. This is simple to do with pandas.
It looks like the CSV file has at least 10 columns in it. Providing the usecols arg should filter out the columns that you're not interested in, and read_csv will ignore them when loading into the pandas DataFrame object (which you've called reader).
Steps to do what you want:
Read the data file using pd.read_csv(). You've already done this, but I recommend calling this variable df instead of reader, as read_csv returns a DataFrame object, not a Reader object. You'll also find it convenient to use the names argument to read_csv to assign column names to the dataframe. It looks like you want names=['id', 'ip_address', 'name', 'latitude','longitude'] to get those as columns. (Assuming col10 is longitude, which makes sense that 9,10 would be lat/long pairs)
Query the dataframe object for the row with that ID that you're interested in. There are a variety of ways to do this. One is using the query syntax. Hard to know why you want that specific row without more details, but you can look up more information about index lookups in pandas. Example: row = df.query("id == 50000")
Given a single row, you want to extract the row values into variables. This is easy if you've assigned column names to your dataframe. You can treat the row as a dictionary of values. E.g. lat = row['lat'] lon = row['long]

You can use iterrows():
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row['col_name']
Or if you want to access by index of the column:
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row.ix[0]

Are the values you need to add the same for each row, or does it require processing the value to determine the value of the addition? If it is consistent you can apply this sum simply using pandas to do a matrix operation on the dataset. If it requires processing row by row, the above solution is the correct one for sure. If it is a table of variables that must be added row by row, you can do that by dumping them all into a column aligned with your dataset, do the addition by row using pandas, and simply print out the complete dataframe. Assume you have three columns to add, which you put into a new column[e].
df['e'] = df.a + df.b + df.d
or, if it is a constant:
df['e'] = df.a + df.b + {constant}
Then drop the columns you don't need (ex df['a'] and df['b'] in the above)
Obviously, then, if you need to calculate based on unique values for each row, put the values into another column and sum as above.

Related

MultiIndex (multilevel) column names from Dataframe rows

I have a rather messy dataframe in which I need to assign first 3 rows as a multilevel column names.
This is my dataframe and I need index 3, 4 and 5 to be my multiindex column names.
For example, 'MINERAL TOTAL' should be the level 0 until next item; 'TRATAMIENTO (ts)' should be level 1 until 'LEY Cu(%)' comes up.
What I need actually is try to emulate what pandas.read_excel does when 'header' is specified with multiple rows.
Please help!
I am trying this, but no luck at all:
pd.DataFrame(data=df.iloc[3:, :].to_numpy(), columns=tuple(df.iloc[:3, :].to_numpy(dtype='str')))
You can pass a list of row indexes to the header argument and pandas will combine them into a MultiIndex.
import pandas as pd
df = pd.read_excel('ExcelFile.xlsx', header=[0,1,2])
By default, pandas will read in the top row as the sole header row. You can pass the header argument into pandas.read_excel() that indicates how many rows are to be used as headers. This can be either an int, or list of ints. See the pandas.read_excel() documentation for more information.
As you mentioned you are unable to use pandas.read_excel(). However, if you do already have a DataFrame of the data you need, you can use pandas.MultiIndex.from_arrays(). First you would need to specify an array of the header rows which in your case would look something like:
array = [df.iloc[0].values, df.iloc[1].values, df.iloc[2].values]
df.columns = pd.MultiIndex.from_arrays(array)
The only issue here is this includes the "NaN" values in the new MultiIndex header. To get around this, you could create some function to clean and forward fill the lists that make up the array.
Although not the prettiest, nor the most efficient, this could look something like the following (off the top of my head):
def forward_fill(iterable):
return pd.Series(iterable).ffill().to_list()
zero = forward_fill(df.iloc[0].to_list())
one = forward_fill(df.iloc[1].to_list())
two = one = forward_fill(df.iloc[2].to_list())
array = [zero, one, two]
df.columns = pd.MultiIndex.from_arrays(array)
You may also wish to drop the header rows (in this case rows 0 and 1) and reindex the DataFrame.
df.drop(index=[0,1,2], inplace=True)
df.reset_index(drop=True, inplace=True)
Since columns are also indices, you can just transpose, set index levels, and transpose back.
df.T.fillna(method='ffill').set_index([3, 4, 5]).T

New column in DataFrame from other columns AND rows

I want to create a new column, V, in an existing DataFrame, df. I would like the value of the new column to be the difference between the value in the 'x' column in that row, and the value of the 'x' column in the row below it.
As an example, in the picture below, I want the value of the new column to be
93.244598 - 93.093285 = 0.151313.
I know how to create a new column based on existing columns in Pandas, but I don't know how to reference other rows using this method. Is there a way to do this that doesn't involve iterating over the rows in the dataframe? (since I have read that this is generally a bad idea)
You can use pandas.DataFrame.shift for your use case.
The last row will not have any row to subtract from so you will get the value for that cell as NaN
df['temp_x'] = df['x'].shift(-1)
df[`new_col`] = df['x'] - df['temp_x']
or one liner :
df[`new_col`] = df['x'] - df['x'].shift(-1)
the column new_col will contain the expected data
An ideal solution is to use diff:
df['new'] = df['x'].diff(-1)

Pandas - merge/join/vlookup df and delete all rows that get a match

I am trying to reference a list of expired orders from one spreadsheet(df name = data2), and vlookup them on the new orders spreadsheet (df name = data) to delete all the rows that contain expired orders. Then return a new spreadsheet(df name = results).
I am having trouble trying to mimic what I do in excel vloookup/sort/delete in pandas. Please view psuedo code/steps as code:
Import simple.xls as dataframe called 'data'
Import wo.xlsm, sheet
name "T" as dataframe called 'data2'
Do a vlookup , using Column
"A" in the "data" to be used to as the values to be
matched with any of the same values in Column "A" of "data2" (there both just Order Id's)
For all values that exist inside Column A in 'data2'
and also exist in Column "A" of the 'data',group ( if necessary) and delete the
entire row(there is 26 columns) for each matched Order ID found in Column A of both datasets. To reiterate, deleting the entire row for the matches found in the 'data' file. Save the smaller dataset as results.
import pandas as pd
data = pd.read_excel("ors_simple.xlsx", encoding = "ISO-8859-1",
dtype=object)
data2 = pd.read_excel("wos.xlsm", sheet_name = "T")
results = data.merge(data2,on='Work_Order')
writer = pd.ExcelWriter('vlookuped.xlsx', engine='xlsxwriter')
results.to_excel(writer, sheet_name='Sheet1')
writer.save()
I re-read your question and think I undertand it correctly. You want to find out if any order in new_orders (you call it data) have expired using expired_orders (you call it data2).
If you rephrase your question what you want to do is: 1) find out if a value in a column in a DataFrame is in a column in another DataFrame and then 2) drop the rows where the value exists in both.
Using pd.merge is one way to do this. But since you want to use expired_orders to filter new_orders, pd.merge seems a bit overkill.
Pandas actually has a method for doing this sort of thing and it's called isin() so let's use that! This method allows you to check if a value in one column exists in another column.
df_1['column_name'].isin(df_2['column_name'])
isin() returns a Series of True/False values that you can apply to filter your DataFrame by using it as a mask: df[bool_mask].
So how do you use this in your situation?
is_expired = new_orders['order_column'].isin(expired_orders['order_column'])
results = new_orders[~is_expired].copy() # Use copy to avoid SettingWithCopyError.
~is equal to not - so ~is_expired means that the order wasn't expired.

I have multiple columns in csv. How do I match row values to a reference column using python?

I have a csv file with 367 columns. The first column has 15 unique values, and each subsequent column has some subset of those 15 values. No unique value is ever found more than once in a column. Each column is sorted. How do I get the rows to line up? My end goal is to make a presence/absence heat map, but I need to get the data matrix in the right format first, which I am struggling with.
Here is a small example of the type of data I have:
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,
I need the rows to match the reference but stay in the same column like so:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
My thought was to use the pandas library, but I could not figure out how to approach this problem, as I am very new to using python. I am using python2.7.
So your problem is definitely solvable via pandas:
Code:
# Create the sample data into a data frame
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
# add each column to the datafarme indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
Results:
1,,1,
2,2,2,2
3,3,3,
4,4,,
5,,5,5
How does it work?:
Pandas can be little daunting when first approached, even for someone who knows python well, so I will try to walk through this. And I encourage you to do what you need to get over the learning curve, because pandas is ridiculously powerful for this sort of data manipulation.
Get the data into a frame:
This first bit of code does nothing but get your sample data into a pandas.DataFrame. Your data format was not specified so I will assume, that you can get it into a frame, or if you can not get it into a frame, will ask another question here on SO about getting the data into a frame.
import pandas as pd
from io import StringIO
df = pd.read_csv(StringIO(u"""
1,2,1,2
2,3,2,5
3,4,3,
4,,5,
5,,,"""), header=None, skip_blank_lines=1).fillna(0)
for column in df:
df[column] = pd.to_numeric(df[column], downcast='integer')
# set the first column as an index
df = df.set_index([0])
Build a result frame:
Start with a result frame that is just the index
# create a frame which we will build up
results = pd.DataFrame(index=df.index)
For each column in the source data, see if the value is in the index
# add each column to the dataframe indicating if the desired value is present
for col in df.columns:
results[col] = df.index.isin(df[col])
That's it, with three lines of code, we have calculated our results.
Output the results:
Now iterate through each row, which contains booleans, and output the values in the desired format (as ints)
# output the dataframe in the desired format
for idx, row in results.iterrows():
result = '%s,%s' % (idx, ','.join(str(idx) if x else ''
for x in row.values))
print(result)
This outputs the index value first, and then for each True value outputs the index again, and for False values outputs an empty string.
Postscript:
There are quite a few people here on SO who are way better at pandas than I am, but since you did not tag your question, with the pandas keyword, they likely did not notice this question. But that allows me to take my cut at answering before they notice. The pandas keyword is very well covered for well formed questions, so I am pretty sure that if this answer is not optimum, someone else will come by and improve it. So in the future, be sure to tag your question with pandas to get the best response.
Also, you mentioned that you were new python, so I will just put in a plug to make sure that you are using a good IDE. I use PyCharm, and it and other good IDEs can make working in python even more powerful, so I highly recommend them.

Returning unique values in .csv and unique strings in python+pandas

my question is very similar to here: Find unique values in a Pandas dataframe, irrespective of row or column location
I am very new to coding, so I apologize for the cringing in advance.
I have a .csv file which I open as a pandas dataframe, and would like to be able to return unique values across the entire dataframe, as well as all unique strings.
I have tried:
for row in df:
pd.unique(df.values.ravel())
This fails to iterate through rows.
The following code prints what I want:
for index, row in df.iterrows():
if isinstance(row, object):
print('%s\n%s' % (index, row))
However, trying to place these values into a previously defined set (myset = set()) fails when I hit a blank column (NoneType error):
for index, row in df.iterrows():
if isinstance(row, object):
myset.update(print('%s\n%s' % (index, row)))
I get closest to what I was when I try the following:
for index, row in df.iterrows():
if isinstance(row, object):
myset.update('%s\n%s' % (index, row))
However, my set prints out a list of characters rather than the strings/floats/values that appear on my screen when I print above.
Someone please help point out where I fail miserably at this task. Thanks!
I think the following should work for almost any dataframe. It will extract each value that is unique in the entire dataframe.
Post a comment if you encounter a problem, i'll try to solve it.
# Replace all nones / nas by spaces - so they won't bother us later
df = df.fillna('')
# Preparing a list
list_sets = []
# Iterates all columns (much faster than rows)
for col in df.columns:
# List containing all the unique values of this column
this_set = list(set(df[col].values))
# Creating a combined list
list_sets = list_sets + this_set
# Doing a set of the combined list
final_set = list(set(list_sets))
# For completion's sake, you can remove the space introduced by the fillna step
final_set.remove('')
Edit :
I think i know what happens. You must have some float columns, and fillna is failing on those, as the code i gave you was replacing missing values with an empty string. Try those :
df = df.fillna(np.nan) or
df = df.fillna(0)
For the first point, you'll need to import numpy first (import numpy as np). It must already be installed as you have pandas.

Categories