mapping row values and column names in python pandas dataframe

mapping row values and column names in python pandas dataframe - python

I get column names list from python pandas dataframe by columnvalues = List(df.columns.values) and row values by df.query('A=="foo"'). However, I will not require all cell values from all columns. I'd like to map or zip them as key(column name): value(cell value) for using separately as an output in an excel sheet.
columnvalues = List(df.columns.values)
['ColA','ColB','ColC','ColD','ColE']
rowData=df.loc[df['ColA']=='apple']
ColA ColB ColC ColD ColE
13 apple NaN height width size
I have columnValues, but if I could also row values I can easily use
dict(zip(colValues, rowValues)) method to create columnKey rowValue based dictionary then by calling dictionary to write output excel files. Because in Excel file which is output file, column numbers and column places differ from how they are set up in dataframe object.
Any ideas on how I can achieve this result, even with a different approach, would be greatly appreciated.
I need method to get this result below;
rowValuesList=['apple', NaN, 'height','width','size']

We could do
rowValuesList = rowData.iloc[0].tolist()

Related

Pulling columns of dataframe into separate dataframe, then replacing duplicates with mean values

I'm new to the world of python so I apologize in advance if this question seems pretty rudimentary. I'm trying to pull columns of one dataframe into a separate dataframe. I want to replace the duplicate columns from the first dataframe with one column that contains the mean values into the second dataframe. I hope this makes sense!
To provide some background, I am tracking gene expression over certain time points. I have a dataframe that is 17 rows x 33 columns. Every row in this data frame corresponds to a particular exon. Every column on this data frame corresponds to a time-point (AGE).
The dataframe looks like this:
Some of these columns contain the same name (age) and I'd like to calculate the mean of ONLY the columns with the same name, so that, for example, I get one column for "12 pcw" rather than three separate columns for "12 pcw." After which I hope to pull these values from the first dataframe into a second dataframe for averaged values.
I'm hoping to use a for loop to loop through each age (column) to get the average expression across the subjects.
I will explain my process so far below:
#1) Get list of UNIQUE string names from age list
unique_ages = set(column_names)
#2) Create an empty dataframe that gives an outline of what I want my averaged data to fit/be put in
mean_df = pd.DataFrame(index=exons, columns=unique_ages)
#3) Now I want to loop through each age to get the average expression across the donors present. This is where I'm trying to utilize a for loop to create a pipeline to process other data frames that I will be working with in the future.
for age in unique_ages:
print(age)
age_df = pd.DataFrame() ##pull columns of df as separate df that have this string
if len(age_df.columns) > 1: ##check if df has >1 SAME column, if so, take avg across SAME columns
mean = df.mean(axis=1)
mean_df[age] = mean
else:
## just pull out the values and put them into your temp_df
#4) Now, with my new averaged array (or same array if multiple ages NOT present), I want to place this array into my 'temp_df' under the appropriate columns. I understand that I should use the 'age' variable provided by the for loop to get the proper locationname of the column in my temp df. However I'm not sure how to do this. This has all been quite a steep learning curve and I feel like it's a simple solution but I can't seem to wrap my head around it. Any help would be greatly appreciated.

There is no need for a for loop (there often isn't with Pandas :)). You can simply use df.groupby(lambda x:x, axis=1).mean(). An example:
data = [[1,2,3],[4,5,6]]
cols = ['col1', 'col2', 'col2']
df = pd.DataFrame(data=data, columns=cols)
# col1 col2 col2
# 0 1 2 3
# 1 4 5 6
df = df.groupby(lambda x:x, axis=1).mean()
# col1 col2
# 0 1.0 2.5
# 1 4.0 5.5
The groupby function takes another function (the lambda) which basically means that it will insert each column name, and that it will return the group that column belongs to. In our case, we just want the column name itself to be the group. So, on the third column named col2, it will say 'this column belongs to group named col2' which already exists (because the second column was passed earlier). You then provide the aggregation you want, in this case the mean().

Populate dataframe by unnesting list of the first column

I have the following issue with a csv in panda the data looks as follow :
Column A :row1: [« a », « b »; « c »
Row2 : [« d »; « e », « f »
Etc …
Note the different delimiters.
I would like it to populate next column based on the cell keys in the list in it like this :
ColA row 1: [a] col b:[b] colc[c]
Row 2: [d] col b:[e] colc:[f]
And so on for as many values there is in a cell I would like it to populate over every column it’s row.
I hope to get some insights from you and that my explanation is clear,
Thanks
Im struggling so far
I can’t share the data but basically I have every row in column A that contains a list csv like with separators and I would like for n number of values within this list in this cell to populate n number of rows in the next columns. , I think I would need to strip the data based on the multiple delimiters and treat them as one ( as you would do in excel ) and then for each row create a function appending each values of the first cell list ? But I’m not sure how to create this…
Each « Keys » of the list in the cell with separated values should go to a the next row (horizontal) in the next column and this for each rows in the data set I would like to un-nest these strings

I'm not sure I understand your I/O but you can try this :
import pandas as pd
df= (
pd.read_csv("test.txt", sep="[;,]", engine="python",
header=None, skiprows=1)
.astype(str).apply(lambda x: x.str.strip("« »"))
)
# convert the numeric index columns to alphabetic letters
df.columns= (
df.columns.astype(str)
.str.replace(r"(\d)",
lambda m: "Col" + chr(ord('#')+ int(float(m.group(0)))+1),
regex=True)
)
# Output:
print(df)
ColA ColB ColC
0 a b c
1 d e f
# .txt used:

Add 3 new columns to DataFrame

Using Python I have the following:
indicators = service.getIndicators(data["temperature"])
The variables data and indicators are of type DataFrame.
In indicators I get 3 columns each with the values of one indicator.
I am adding the 3 columns to data DataFrame where first column has Temperature values:
data["InA"] = indicators[indicators.columns[0]]
data["InAB"] = indicators[indicators.columns[1]]
data["OutC"] = indicators[indicators.columns[2]]
Is there a shorter way to call getIndicators and place the result in data DataFrame.
I feel I am using to much code just for this.

parsing tab delimited values from text file to variables

Hello I've been struggling with this problem, I'm trying to iterate over rows and select data from them and then assign them to variables. this is the first time I'm using pandas and I'm not sure how to select the data
reader = pd.read_csv(file_path, sep="\t" ,lineterminator='\r', usecols=[0,1,2,9,10],)
for row in reader:
print(row)
#id_number = row[0]
#name = row[2]
#ip_address = row[1]
#latitude = row[9]
and this is the output from the row that I want to assign to the variables:
050000
129.240.228.138
planetlab2.simula.no
59.93
Edit: Perhaps this is not a problem for pandas but for general Python. I am fairly new to python and what I'm trying to achieve is to parse tab separated file line by line and assign data to the variables and print them in one loop.
this is the input file sample:
050263 128.2.211.113 planetlab-1.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown
050264 128.2.211.115 planetlab-3.cmcl.cs.cmu.edu NA US Allegheny County Pittsburgh http://www.cs.cmu.edu/ Carnegie Mellon University 40.4446 -79.9427 unknown

The general workflow you're describing is: you want to read in a csv, find a row in the file with a certain ID, and unpack all the values from that row into variables. This is simple to do with pandas.
It looks like the CSV file has at least 10 columns in it. Providing the usecols arg should filter out the columns that you're not interested in, and read_csv will ignore them when loading into the pandas DataFrame object (which you've called reader).
Steps to do what you want:
Read the data file using pd.read_csv(). You've already done this, but I recommend calling this variable df instead of reader, as read_csv returns a DataFrame object, not a Reader object. You'll also find it convenient to use the names argument to read_csv to assign column names to the dataframe. It looks like you want names=['id', 'ip_address', 'name', 'latitude','longitude'] to get those as columns. (Assuming col10 is longitude, which makes sense that 9,10 would be lat/long pairs)
Query the dataframe object for the row with that ID that you're interested in. There are a variety of ways to do this. One is using the query syntax. Hard to know why you want that specific row without more details, but you can look up more information about index lookups in pandas. Example: row = df.query("id == 50000")
Given a single row, you want to extract the row values into variables. This is easy if you've assigned column names to your dataframe. You can treat the row as a dictionary of values. E.g. lat = row['lat'] lon = row['long]

You can use iterrows():
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row['col_name']
Or if you want to access by index of the column:
df = pandas.read_csv(file_path, sep=',')
for index, row in df.iterrows():
value = row.ix[0]

Are the values you need to add the same for each row, or does it require processing the value to determine the value of the addition? If it is consistent you can apply this sum simply using pandas to do a matrix operation on the dataset. If it requires processing row by row, the above solution is the correct one for sure. If it is a table of variables that must be added row by row, you can do that by dumping them all into a column aligned with your dataset, do the addition by row using pandas, and simply print out the complete dataframe. Assume you have three columns to add, which you put into a new column[e].
df['e'] = df.a + df.b + df.d
or, if it is a constant:
df['e'] = df.a + df.b + {constant}
Then drop the columns you don't need (ex df['a'] and df['b'] in the above)
Obviously, then, if you need to calculate based on unique values for each row, put the values into another column and sum as above.

Pandas merge how to avoid unnamed column

There are two DataFrames that I want to merge:
DataFrame A columns: index, userid, locale (2000 rows)
DataFrame B columns: index, userid, age (300 rows)
When I perform the following:
pd.merge(A, B, on='userid', how='outer')
I got a DataFrame with the following columns:
index, Unnamed:0, userid, locale, age
The index column and the Unnamed:0 column are identical. I guess the Unnamed:0 column is the index column of DataFrame B.
My question is: is there a way to avoid this Unnamed column when merging two DFs?
I can drop the Unnamed column afterwards, but just wondering if there is a better way to do it.

In summary, what you're doing is saving the index to file and when you're reading back from the file, the column previously saved as index is loaded as a regular column.
There are a few ways to deal with this:
Method 1
When saving a pandas.DataFrame to disk, use index=False like this:
df.to_csv(path, index=False)
Method 2
When reading from file, you can define the column that is to be used as index, like this:
df = pd.read_csv(path, index_col='index')
Method 3
If method #2 does not suit you for some reason, you can always set the column to be used as index later on, like this:
df.set_index('index', inplace=True)
After this point, your datafame should look like this:
userid locale age
index
0 A1092 EN-US 31
1 B9032 SV-SE 23
I hope this helps.

Either don't write index when saving DataFrame to CSV file (df.to_csv('...', index=False)) or if you have to deal with CSV files, which you can't change/edit, use usecols parameter:
A = pd.read_csv('/path/to/fileA.csv', usecols=['userid','locale'])
in order to get rid of the Unnamed:0 column ...

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.