python alternative to scan('file', what=list(...)) in R - python

I have a file in following format:
10000
2
2
2
2
0.00
0.00
0 1
0.00
0.01
0 1
...
I want to create a dataframe from this file (skipping the first 5 lines) like this:
x1 x2 y1 y2
0.00 0.00 0 1
0.00 0.01 0 1
So the lines are converted to columns (where each third line is also split into two columns, y1 and y2).
In R I did this as follows:
df = as.data.frame(scan(".../test.txt", what=list(x1=0, x2=0, y1=0, y2=0), skip=5))
I am looking for a python alternative (pandas?) to this scan(file, what=list(...)) function.
Does it exist or do I have to write a more extended script?

You can skip the first 5, and then take groups of 4 to build a Python list, then put that in pandas as a start... I wouldn't be surprised if pandas offered something better though:
from itertools import islice, izip_longest
with open('input') as fin:
# Skip header(s) at start
after5 = islice(fin, 5, None)
# Take remaining data and group it into groups of 4 lines each... The
# first 2 are float data, the 3rd is two integers together, and the 4th
# is the blank line between groups... We use izip_longest to ensure we
# always have 4 items (padded with None if needs be)...
for lines in izip_longest(*[iter(after5)] * 4):
# Convert first two lines to float, and take 3rd line, split it and
# convert to integers
print map(float, lines[:2]) + map(int, lines[2].split())
#[0.0, 0.0, 0, 1]
#[0.0, 0.01, 0, 1]

As far as I know I cannot see any options here http://pandas.pydata.org/pandas-docs/stable/io.html to organize your DataFrame as you want;
But you can achieve it easly:
lines = open('YourDataFile.txt').read() # read the whole file
import re # import re
elems = re.split('\n| ', lines)[5:] # split each element and exclude the first 5
grouped = zip(*[iter(elems)]*4) # group them 4 by 4
import pandas as pd # import pandas
df = pd.DataFrame(grouped) # construct DataFrame
df.columns = ['x1', 'x2', 'y1', 'y2'] # columns names
It's not concise, it's not elegant, but it's clear what it does...

OK, here's how I did it (it is in fact a combo of Jon's & Giupo's answer, tnx guys!):
with open('myfile.txt') as file:
data = file.read().split()[5:]
grouped = zip(*[iter(data)]*4)
import pandas as pd
df = pd.DataFrame(grouped)
df.columns = ['x1', 'x2', 'y1', 'y2']

Related

How to save values in pandas dataframe after editing some values

I have a dataframe which looks like this (It contains dummy data) -
I want to remove the text which occurs after "_________" identifier in each of the cells. I have written the code as follows (Logic: Adding a new column containing NaN and saving the edited values in that column) -
import pandas as pd
import numpy as np
df = pd.read_excel(r'Desktop\Trial.xlsx')
NaN = np.nan
df["Body2"] = NaN
substring = "____________"
for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
row["Body2"] = split_string[0]
print(df)
But the Body2 column still displays NaN and not the edited values.
Any help would be much appreciated!
`for index, row in df.iterrows():
if substring in row["Body"]:
split_string = row["Body"].split(substring,1)
#row["Body2"] = split_string[0] # instead use below line
df.at[index,'Body2'] = split_string[0]`
Make use of at to modify the value
Instead of iterating through the rows, do the operation on all rows at once. You can use expand to split the values into multiple columns, which I think is what you want.
substring = "____________"
df = pd.DataFrame({'Body': ['a____________b', 'c____________d', 'e____________f', 'gh']})
df[['Body1', 'Body2']] = df['Body'].str.split(substring, expand=True)
print(df)
# Body Body1 Body2
# 0 a____________b a b
# 1 c____________d c d
# 2 e____________f e f
# 3 gh gh None

Calculating most frequently occuring row-specific combinations among dataframe in Python

I have a dataframe that contains text separated by a comma
1 a,b,c,d
2 a,b,e,f
3 a,b,e,f
I am trying to have an output that prints the top 2 most common combinations of 2 letters + the # of occurrences among the entire dataframe. So based on the above dataframe the output would be
(a,b,3) (e,f,2)
The combination of a and b occurs 3 times, and the combination of e and f occurs 2 times. (Yes there are more combos that occur 2 times but we can just cut it off here to keep it simple) I am really stumped on just how to even start this. I was thinking of maybe looping through each row and somehow storing all combinations, and at the end we can print out the top n combinations and how many times they occurred in the dataframe.
Below is what I have so far according to what I have in mind.
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
for index, row in df.iterrows():
(somehow get and store all possible 2 word combos?)
You can do it this way:
import numpy as np
import pandas as pd
from io import StringIO
StringData = StringIO("""Date
a,b,c,d
a,b,e,f
a,b,e,f
""")
df = pd.read_csv(StringData, sep =";")
df['Date'] = df['Date'].apply(lambda x: x.split(','))
df['combinations'] = df['Date'].apply(lambda x: [(x[i], x[i+1]) for i in range(len(x)-1)])
df = df.explode('combinations')
df = df.groupby('combinations').agg('count').reset_index()
df.sort_values('Date', inplace=True, ascending=False)
df['combinations'] = df.values.tolist()
df.drop('Date', axis=1, inplace=True)
df['combinations'] = df['combinations'].apply(np.hstack)
print(df.iloc[:2, :])
Output:
combinations
0 [a, b, 3]
2 [b, e, 2]

Python - Encoding Genomic Data in dataframe

Hi I'm trying to encode a Genome, stored as a string inside a dataframe read from a CSV.
Right now I'm looking to split each string in the dataframe under the column 'Genome' into a list of it's base pairs i.e. from ('acgt...') to ('a','c','g','t'...) then convert each base pair into a float (0.25,0.50,0.75,1.00) respectively.
I thought I was looking for a split function to split each string into characters but none seem to work on the data in the dataframe even when transformed to string using .tostring
Here's my most recent code:
import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def string_to_array(my_string):
my_string = my_string.lower()
my_string = re.sub('[^acgt]', 'z', my_string)
my_array = np.array(list(my_string))
return my_array
label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','g','c','t','z']))
def ordinal_encoder(my_array):
integer_encoded = label_encoder.transform(my_array)
float_encoded = integer_encoded.astype(float)
float_encoded[float_encoded == 0] = 0.25 # A
float_encoded[float_encoded == 1] = 0.50 # C
float_encoded[float_encoded == 2] = 0.75 # G
float_encoded[float_encoded == 3] = 1.00 # T
float_encoded[float_encoded == 4] = 0.00 # anything else, z
return float_encoded
dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath)
df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring()))
print(df)
I've tried making my own function but I don't have any clue how they work. Everything I try points to not being able to process data when it's in a numpy array and nothing is working to transform the data to another type.
Thanks for the tips!
Edit: Here is the print of the dataframe-
Antibiotic ... Genome
0 isoniazid ... ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1 isoniazid ... gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2 isoniazid ... aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3 isoniazid ... gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4 isoniazid ... ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...
There are 5 columns 'Genome' being the 5th in the list I don't know why 1. .head() will not work and 2. why print() doesn't give me all columns...
I don't think LabelEncoder is what you want. This is a simple transformation, I recommend doing it directly. Start with a lookup your base pair mapping:
lookup = {
'a': 0.25,
'g': 0.50,
'c': 0.75,
't': 1.00
# z: 0.00
}
Then apply the lookup to value of the "Genome" column. The values attribute will return the resulting dataframe as an ndarray.
dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values

Pandas- How to save frequencies of different values in different columns line by line in a csv file (including 0 frequencies)

I have a CSV file with the following columns of interest
fields = ['column_0', 'column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7', 'column_8', 'column_9']
for each of these columns, there are 153 lines of data, containing only two values: -1 or +1
My problem is that, for each column, I would like to save the frequencies of each -1 and +1 values in comma-separated style line by line in a CSV file. I have the following problems when I do the following:
>>>df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
>>>print df['column_2'].value_counts()
1 148
-1 5
>>>df['column_2'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
148
5
Which is obviously what I dont want, I want the values in the same line of the text file separated by comma (e.g., 148, 5).
The second problem I have happens when one of the frequencies are zero,
>>> print df['column_9'].value_counts()
1 153
>>> df['column_9'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
153
I also dont want that behavior, I would like to see 153, 0
So, in summary, I would like to know how to do that with Pandas
Given one column, save its different values frequencies in the same line of a csv file and separated by commas. For example:
148,5
If there is a value with frequency 0, put that in the CSV. For example:
153,0
Append these frequency values in different lines of the same CSV file. For example:
148,5
153,0
Can I do that with pandas? or should I move to other python lib?
Example with some dummy data:
import pandas as pd
df = pd.DataFrame({'col1': [1, 1, 1, -1, -1, -1],
'col2': [1, 1, 1, 1, 1, 1],
'col3': [-1, 1, -1, 1, -1, -1]})
counts = df.apply(pd.Series.value_counts).fillna(0).T
print(counts)
Output:
-1 1
col1 3.0 3.0
col2 0.0 6.0
col3 4.0 2.0
You can then export this to csv.
See this answer for ref:
How to get value counts for multiple columns at once in Pandas DataFrame?
I believe you could do what you want like this
import io
import pandas as pd
df = pd.DataFrame({'column_1': [1,-1,1], 'column_2': [1,1,1]})
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
df['column_1'].value_counts().to_frame().T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
But I would suggest something like this since you would have to otherwise specify that one of the expected values were missing
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
counts = df[['column_1', 'column_2']].apply(lambda column: column.value_counts())
counts = counts.fillna(0)
counts.T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
Here is an example with three columns c1, c2, c3 and data frame d which is defined before the function is invoked.
import pandas as pd
import collections
def wcsv(d):
dc=[dict(collections.Counter(d[i])) for i in d.columns]
for i in dc:
if -1 not in list(i.keys()):
i[-1]=0
if 1 not in list(i.keys()):
i[1]=0
w=pd.DataFrame([ list(j.values()) for j in dc],columns=['1','-1'],index=['c1','c2','c3'])
w.to_csv("t.csv")
d=pd.DataFrame([[1,1,-1],[-1,1,1],[1,1,-1],[1,1,-1]],columns=['c1','c2','c3'])
wcsv(d)

Selectively replacing DataFrames column names

I have a time series dataset in a .csv file that I want to process with Pandas (using Canopy). The column names from the file are a mix of strings and isotopic numbers.
cycles 40 38.02 35.98 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
I would like this DataFrame to look like this
cycles 40 38 36 P4
0 1 1.1e-8 4.4e-8 7.7e-8 8.8e-7
1 2 2.2e-8 5.5e-8 8.8e-8 8.7e-7
2 3 3.3e-8 6.6e-8 9.9e-8 8.6e-7
The .csv files won't always have exactly the same column names; they numbers could be slightly different from file to file. To handle this, I've sampled the column names and rounded the values to the nearest integer.This is what my code looks like so far:
import pandas as pd
import numpy as np
df = {'cycles':[1,2,3],'40':[1.1e-8,2.2e-8,3.3e-8],'38.02':[4.4e-8,5.5e-8, 6.6e-8],'35.98':[7.7e-8,8.8e-8,9.9e-8,],'P4':[8.8e-7,8.7e-7,8.6e-7]}
df = pd.DataFrame(df, columns=['cycles', '40', '38.02', '35.98', 'P4'])
colHeaders = df.columns.values.tolist()
colHeaders[1:4] = list(map(float, colHeaders[1:4]))
colHeaders[1:4] = list(map(np.around, colHeaders[1:4]))
colHeaders[1:4] = list(map(int, colHeaders[1:4]))
colHeaders = list(map(str, colHeaders))
I tried df.rename(columns={df.loc[ 1 ]:colHeaders[ 0 ]}, ...), but I get this error:
TypeError: 'Series' objects are mutable, thus they cannot be hashed
I've read this post as well as the pandas 0.17 documentation, but I can't figure out how to use it to selectively replace the column names in a way that doesn't require me to assign new column names manually like this post.
I'm fairly new to Python and I've never posted on StackOverflow before, so any help would be greatly appreciated.
You could use a variant of your approach, but assign the new columns directly:
>>> cols = list(df.columns)
>>> cols[1:-1] = [int(round(float(x))) for x in cols[1:-1]]
>>> df.columns = cols
>>> df
cycles 40 38 36 P4
0 1 1.100000e-08 4.400000e-08 7.700000e-08 8.800000e-07
1 2 2.200000e-08 5.500000e-08 8.800000e-08 8.700000e-07
2 3 3.300000e-08 6.600000e-08 9.900000e-08 8.600000e-07
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')
Or you could pass a function to rename:
>>> df = df.rename(columns=lambda x: x if x[0].isalpha() else int(round(float(x))))
>>> df.columns
Index(['cycles', 40, 38, 36, 'P4'], dtype='object')

Categories