Processing pandas dataframe with variable delimiter and row length - python

I have a tedious csv file with the following format
HELLO 1000 db1 3.88
HELLO 10 db123456 3.8899949
HELLO repository 10.0000
HELLO rep 001 0.001
Basically, the first four characters are always constant, while names are in different length and different separators
(for example, "1000 db1"), and final values are all float numbers but again in different formats/lengths.
What I would like is to be able to read columns as
constant name value
HELLO ..... ....
I have looked for a solution but can't figure out. Initially, I was trying
df.map(lambda x: x[...])
to cut the last values but it does not work as the last values do not always have the same length.
Thanks in advance

I suppose you want to split the CSV into three columns. You can use re module for the task (if file.csv is in the format as you describe in your question):
import re
with open('file.csv', 'r') as f_in:
df = pd.DataFrame(re.findall(r'([^\s]+)\s(.*)\s(.+)', f_in.read()), columns=['constant', 'name', 'value'])
print(df)
Prints:
constant name value
0 HELLO 1000 db1 3.88
1 HELLO 10 db123456 3.8899949
2 HELLO repository 10.0000
3 HELLO rep 001 0.001

Related

Reading txt file (similar to dictionary format) into pandas dataframe

I have a txt file that looks like this:
('GTCC', 'ACTB'): 1
('GTCC', 'GAPDH'): 2
('CGAG', 'ACTB'): 1
('CGAG', 'GAPDH'): 4
where the first string is a gRNA name, the second string is a gene name, and the number is a count of those two strings occurring together.
I want to read this into a pandas dataframe and re-shape it so that it looks like this:
ACTB GAPDH
GTCC 1 2
CGAG 1 4
How might I do this?
The file will not always be this size-- it will often be much larger (200 gRNA names x 20 gene names) but the size will be variable. There will always only be one gRNA name and one gene name per count. The titles of the columns/rows are accurate as to what the real file will look like (some string of letters for the rows and some gene name for the columns).
This is certainly not the cleanest way to do it, but I figured out a way to get what I wanted:
df = pd.read_csv('test.txt', sep=",|:", engine ='python', names=['gRNA','gene','count'])
df["gRNA"]=df["gRNA"].str.replace("(","")
df["gRNA"]=df["gRNA"].str.replace("'","")
df["gene"]=df["gene"].str.replace(")","")
df["gene"]=df["gene"].str.replace("'","")
df=df.pivot(index='gRNA', columns='gene', values='count')

Creating new dataframe with .txt file using Pandas

I have a text file with data displayed like this:
{"created_at":"Mon Jun 02 00:04:00 +0000 2018","id":870430762953920,"id_str":"87043076220","text":"Hello there","source":"\u003ca href=\"http:\/\/tapbots.com\/software\/tweetbot\/mac\" rel=\"nofollow\"\u003eTweetbot for Mac\u003c\/a\u003e","truncated":false,"in_reply_to_status_id"}
The data is twitter posts and I have hundreds of these in one text file. I want to get the key value pair of "text":"Hello there" and turn that into it's own dataframe with a third column named target. I don't need any of the other columns. I'm doing some sensitivity analysis.
What would be the most pythonic way to go about this? I thought about using the
df = pd.read_csv('test.txt', sep=r'"'), but then I don't know how to get rid of all the other columns i don't need and select the column with the text in it.
Any help would be much appreciated!
I had to modify the lost two key/value pairs in your data to work. You may want to check if you're getting the data correctly or if you copy and pasted properly because you should be getting errors with the data as is displayed in your post.
"truncated":False,"in_reply_to_status_id":1
Then this worked well for me:
import pandas as pd
with open('test.txt','r') as inf1: # reads the text file as code to evaluate
d =eval(inf1.read())
index = range(len(d))
df = pd.DataFrame(d,index=index) # have to add index to because the entire df are scalar values
df = df.pop('text')
print(df)
Returns
0 Hello there
1 Hello there
2 Hello there
3 Hello there
4 Hello there
5 Hello there
6 Hello there
Name: text, dtype: object

Pandas apply Series- Order of the columns

To aggregate and to find values per second, I am doing following in Python using pandas, however, in the output logged to a file doesn't show columns in the way they appear here. Somehow these column names are sorted and hence TotalDMLsSec shows up before UpdateTotal and UpdatesSec.
'DeletesTotal': x['Delete'].sum(),
'DeletesSec': x['Delete'].sum()/VSeconds,
'SelectsTotal': x['Select'].sum(),
'SelectsSec': x['Select'].sum()/VSeconds,
'UpdateTotal': x['Update'].sum(),
'UpdatesSec': x['Update'].sum()/VSeconds,
'InsertsTotal': x['Insert'].sum(),
'InsertsSec': x['Insert'].sum()/VSeconds,
'TotalDMLsSec':(x['Delete'].sum()+x['Update'].sum()+x['Insert'].sum())/VSeconds
})
)
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')
Apart from above questions, have couple of other questions-
Despite logging as csv format, all values/columns are appearing in one column in excel, is there anyway to properly load data CSV
Can rows be sorted based on one column(let say InsertsSec) by default when writing to csv file?
Any help here would be really appreciated.
Assume that your DataFrame is something like this:
Deletes Selects Updates Inserts
Name
Xxx 20 10 40 50
Yyy 12 32 24 11
Zzz 70 20 30 20
Then both total and total per sec can be computed as:
total = df.sum().rename('Total')
VSeconds = 5 # I assumed some value
tps = (total / VSeconds).rename('Total per sec')
Then you can add both above rows to the DataFrame:
df = df.append(totals).append(tps)
The downside is that all numbers are converted to float.
But in Pandat there is no other way, as each column must have
values of one type.
Then you can e.g. write it to a CSV file (with totals included).
This is how I eneded up doing
df.to_excel(vExcelFile,'All')
vSortedDF=df.sort_values(['Deletes%'],ascending=False)
vSortedDF.loc[vSortedDF['Deletes%']> 5, ['DeletesTotal','DeletesSec','Deletes%']].to_excel(vExcelFile,'Top Delete objects')
vExcelFile.save()
For CSV, instead of using separate \t used , and it worked just fine.
df.to_csv(/home/summary.log,sep='\t', encoding='utf-8-sig')

pandas dataframe: duplicates based on column and time range

I have a (very simplyfied here) pandas dataframe which looks like this:
df
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
2 2012-11-21 17:00:08 u3 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
without the third row, as its text is the same as in row one and two, but its timestamp is not
within the range of 3 seconds.
I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:
mask = df.duplicated(subset=['datetime', 'msg'], keep=False)
print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []
Is there a way where I can define a range for my "datetime" parameter? To illustrate, something
like:
mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)
Any help here would as always be very much appreciated.
This Piece of code gives the expected output
df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]
I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.
Before using above code make sure that your dataframe is sorted on datetime in ascending order.
This bit of code works on your example data, although you might have to play around with any extreme cases.
From your question I'm assuming you want to filter out messages from the first time it appears in df. It won't work if you have instances where you want to keep the string if it appears again after another threshold.
In short I wrote a function that will take your dataframe and the 'msg' to filter for. It takes the timestamp of the first time the message appears and compares that to all the other times it appears.
It then selects only the instances where it appears within 3 seconds of the first appearance.
import numpy as np
import pandas as pd
#function which will return dataframe containing messages within three seconds of the first message
def get_info_within_3seconds(df, msg):
df_of_msg = df[df['msg']==msg].sort_values(by = 'datetime')
t1 = df_of_msg['datetime'].reset_index(drop = True)[0]
datetime_deltas = [(i -t1).total_seconds() for i in df_of_msg['datetime']]
filter_list = [i <= 3.0 for i in datetime_deltas]
return df_of_msg[filter_list]
msgs = df['msg'].unique()
#apply function to each unique message and then create a new df
new_df = pd.concat([get_info_within_3seconds(df, i) for i in msgs])

Pandas: Splitting and editing file based on dictionary

I'm new to pandas and having a little trouble solving the following problem.
I have two files I need to use to create output. The first file contains a list on functions and associated genes.
an example of the file (with obviously completely made up data)
File 1:
Function Genes
Emotions HAPPY,SAD,GOOFY,SILLY
Walking LEG,MUSCLE,TENDON,BLOOD
Singing VOCAL,NECK,BLOOD,HAPPY
I'm reading into a dictionary using:
from collections import *
FunctionsWithGenes = defaultdict(list)
def read_functions_file(File):
Header = File.readline()
Lines = File.readlines()
for Line in Lines:
Function, Genes = Line[0], Line[1]
FunctionsWithGenes[Function] = Genes.split(",") # the genes for each function are in the same row and separated by commas
The second table contains all the information I need in a .txt file that contains a column of genes
for example:
chr start end Gene Value MoreData
chr1 123 123 HAPPY 41.1 3.4
chr1 342 355 SAD 34.2 9.0
chr1 462 470 LEG 20.0 2.7
that I read in using:
import pandas as pd
df = pd.read_table(File)
The dataframe contains multiple columns one of which is "Genes". This column can contain a variable number of entries. I would like to split the dataframe by the "Function" key in the FunctionsWithGenes dictionary. So far I have:
df = df[df["Gene"].isin(FunctionsWithGenes.keys())] # to remove all rows with no matching entries
Now I need to somehow split the dataframe based on gene functions. I was thinking perhaps to add a new column with gene function, but not sure if that would work since some genes can have more than one function.
I'm a little confused by your last line of code:
df = df[df["Gene"].isin(FunctionsWithGenes.keys())]
since the keys of FunctionsWithGenes are the actual functions (Emotions etc...) but the genes columns have the values. The resulting DataFrame would always be empty.
If I understand you correctly, you would like to split the table up so that all the genes belonging to a function are in one table, if that's the case, you could use a simple dictionary comprehension, I set up some variables similar to yours:
>>> for function, genes in FunctionsWithGenes.iteritems():
... print function, genes
...
Walking ['LEG', 'MUSCLE', 'TENDON', 'BLOOD']
Singing ['VOCAL', 'NECK', 'BLOOD', 'HAPPY']
Emotions ['HAPPY', 'SAD', 'GOOFY', 'SILLY']
>>> df
Gene Value
0 HAPPY 3.40
1 SAD 4.30
2 LEG 5.55
Then I split up the the DataFrame like this:
>>> FunctionsWithDf = {function:df[df['Gene'].isin(genes)]
... for function, genes in FunctionsWithGenes.iteritems()}
Now FunctionsWithDf is a dictionary which maps Function to a DataFrame with all rows whose Gene columns is in the value of FunctionsWithGenes[Function]
For example:
>>> FunctionsWithDf['Emotions']
Gene Value
0 HAPPY 3.4
1 SAD 4.3
>>> FunctionsWithDf['Singing']
Gene Value
0 HAPPY 3.4

Categories