I have a text file with data which looks like this:
NCP_341_1834_0022.png 2 0 130 512 429
I would like to split the data into different columns with names like this:
['filename','class','xmin','ymin','xmax','ymax']
I have done this:
test_txt = pd.read_csv(r"../input/covidxct/train_COVIDx_CT-3A.txt")
test_txt.to_csv(r"../working/test/train.csv",index=None, sep='\t')
train = pd.read_csv("../working/test/train.csv")
However when I download the .csv file, it gives me the data line all in one column, as opposed to 6 columns. How can I fix this?
Just set the right separator (',' by default):
test_txt = pd.read_csv(r"../input/covidxct/train_COVIDx_CT-3A.txt", sep=' ', header=None)
if you are using test_COVIDx_CT-3A.txt from Kaggle.
Don't forget to set header=None since there is no header. You can also use colnames=['image', 'col1', 'col2', ...] to replace default names (0, 1, 2, ...)
Just to answer my own question, You can use str to split the single .csv file into different columns. For me, I split it into 6 columns, for my 6 labels:
train[['filename', 'class','xmin','ymin','xmax','ymax']] = train['NCP_96_1328_0032.png 2 9 94 512 405'].str.split(' ', 6, expand=True)
train.head()
Then just drop the column you dont need:
train.drop(train.columns[[0]], axis=1)
Related
I have a dataset containing multiple columns and each column contains tuple data. I want to save this data so that I can recall it in different python notebooks without having to rerun everything again. Any format is okay (csv, JSON, etc.).
id
pos_tag
clean data
1
[(sangat, RB), (baik, JJ)]
[baik]
2
[(sangat, RB), (membantu, VB)]
[membantu]
3
[(kenapa, WH), (kok, NN), (perbaikan, NN), (sistem, NN), (bayar, VB), (bisa, MD)]
[perbaikan, sistem, bayar]
This is what I found so far...
df.to_pickle('test.pkl')
new_df = pd.read_pickle('test.pkl')
Will also need help on how to recall the data.
Just in case you want write data to csv and retrieve it back, I have an example for you:
# Sample dataframe written to csv file and read back
d = {'id': [1, 2], 'pos_tag': [(1, 2), ("a", "b", "c")]}
df1 = pd.DataFrame(d)
df1.to_csv("/content/df1.csv", sep="|", index=False)
df2 = pd.read_csv("/content/df1.csv", sep="|", )
df2.head()
I save it to csv with a pipe delimiter, just to make sure it is different from other literals like comma, semicolon etc.
# Retrieve tuple values
df2['pos_tag'] = [eval(s) for s in df2['pos_tag']]
df2.head()
By reading txt files into data frame, I want to add a new column based on the values in the existing columns, i.e. adding the numeric values from 'Stock' and 'Delivery'.
The problem is, the original data (from data supplier), was generated from "df.to_markdowns()".
Seems I can't remove the white spaces.
ds = pd.read_csv("C:\\TEMP\\ff.txt", sep="|", header = 0, skipinitialspace=True)
ds.columns = ds.columns.str.strip()
df['new'] = ds['Stock'] + ds['Delivery']
print (df)
What would be the way to handle such case? Thank you.
By the way, this simulates the txt file creation from "df.to_markdown()"
import pandas as pd
data = {'Price': [59,98,79],
'Stock': [53,60,60],
'Delivery': [11,7,6]}
df = pd.DataFrame(data)
with open("C:\\TEMP\\ff.txt", 'a') as outfile:
outfile.write(df.to_markdown() + "\n")
outfile.close
This should do what you need.
ds = pd.read_csv(
"C:\\TEMP\\ff.txt",
sep="|",
skiprows=[1],
skipinitialspace=True
)
ds.columns = ds.columns.str.strip()
ds = ds.loc[:, ["Price", "Stock", "Delivery"]]
ds['new'] = ds['Stock'] + ds['Delivery']
print(ds)
output
Price Stock Delivery new
0 59 53 11 64
1 98 60 7 67
2 79 60 6 66
skiprows=[1] skips the row at index 1, which is the row with the --------:
With this row removed from the dataframe, pandas automatically interprets the Price, Stock, and Delivery columns as integers, which allows the statement ds['new'] = ds['Stock'] + ds['Delivery'] to work as expected.
This works on the example you have provided:
pd.read_csv("~/Downloads/ff.txt", sep=r"\s*\|\s*", engine="python", skiprows=[1])[["Price", "Stock", "Delivery"]]
If you want something else I suggest you provide an example for it.
I have data that I want to retrieve from a couple of text files in a folder. For each file in the folder, I create a pandas.DataFrame to store the data. For now it works correctly and all the fils has the same number of rows.
Now what I want to do is to add each of these dataframes to a 'master' dataframe containing all of them. I would like to add each of these dataframes to the master dataframe with their file name.
I already have the file name.
For example, let say I have 2 dataframes with their own file names, I want to add them to the master dataframe with a header for each of these 2 dataframes representing the name of the file.
What I have tried now is the following:
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame()
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_name = getFileName(file)
t0_data.insert(loc=len(t0_data.columns), column=file_name, value=file_data)
Could someone help me with this please?
Thank you :)
Edit:
I think I was not clear enough, this is what I am expecting as an output:
output
You may be looking for the concat function. Here's an example:
import pandas as pd
A = pd.DataFrame({'Col1': [1, 2, 3], 'Col2': [4, 5, 6]})
B = pd.DataFrame({'Col1': [7, 8, 9], 'Col2': [10, 11, 12]})
a_filename = 'a_filename.txt'
b_filename = 'b_filename.txt'
A['filename'] = a_filename
B['filename'] = b_filename
C = pd.concat((A, B), ignore_index = True)
print(C)
Output:
Col1 Col2 filename
0 1 4 a_filename.txt
1 2 5 a_filename.txt
2 3 6 a_filename.txt
3 7 10 b_filename.txt
4 8 11 b_filename.txt
5 9 12 b_filename.txt
There are a couple changes to make here in order to make this happen in an easy way. I'll list the changes and reasoning below:
Specified which columns your master DataFrame will have
Instead of using some function that it seems like you were trying to define, you can simply create a new column called "file_name" that will be the filepath used to make the DataFrame for every record in that DataFrame. That way, when you combine the DataFrames, each record's origin is clear. I commented that you can make edits to that particular portion if you want to use string methods to clean up the filenames.
At the end, don't use insert. For combining DataFrames with the same columns (a union operation if you're familiar with SQL or with set theory), you can use the append method.
# T0 data
t0_path = "C:/Users/AlexandreOuimet/Box Sync/Analyse Opto/Crunch/GF data crunch/T0/*.txt"
t0_folder = glob.glob(t0_path)
t0_data = pd.DataFrame(columns=['wavelength', 'max', 'min','file_name'])
for file in t0_folder:
raw_data = parseGFfile(file)
file_data = pd.DataFrame(raw_data, columns=['wavelength', 'max', 'min'])
file_data['file_name'] = file #You can make edits here
t0_data = t0_data.append(file_data,ignore_index=True)
I have a CSV file with the following columns of interest
fields = ['column_0', 'column_1', 'column_2', 'column_3', 'column_4', 'column_5', 'column_6', 'column_7', 'column_8', 'column_9']
for each of these columns, there are 153 lines of data, containing only two values: -1 or +1
My problem is that, for each column, I would like to save the frequencies of each -1 and +1 values in comma-separated style line by line in a CSV file. I have the following problems when I do the following:
>>>df = pd.read_csv('data.csv', skipinitialspace=True, usecols=fields)
>>>print df['column_2'].value_counts()
1 148
-1 5
>>>df['column_2'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
148
5
Which is obviously what I dont want, I want the values in the same line of the text file separated by comma (e.g., 148, 5).
The second problem I have happens when one of the frequencies are zero,
>>> print df['column_9'].value_counts()
1 153
>>> df['column_9'].value_counts().to_csv('result.txt', index=False )
Then, when I open results.txt, here is what I found
153
I also dont want that behavior, I would like to see 153, 0
So, in summary, I would like to know how to do that with Pandas
Given one column, save its different values frequencies in the same line of a csv file and separated by commas. For example:
148,5
If there is a value with frequency 0, put that in the CSV. For example:
153,0
Append these frequency values in different lines of the same CSV file. For example:
148,5
153,0
Can I do that with pandas? or should I move to other python lib?
Example with some dummy data:
import pandas as pd
df = pd.DataFrame({'col1': [1, 1, 1, -1, -1, -1],
'col2': [1, 1, 1, 1, 1, 1],
'col3': [-1, 1, -1, 1, -1, -1]})
counts = df.apply(pd.Series.value_counts).fillna(0).T
print(counts)
Output:
-1 1
col1 3.0 3.0
col2 0.0 6.0
col3 4.0 2.0
You can then export this to csv.
See this answer for ref:
How to get value counts for multiple columns at once in Pandas DataFrame?
I believe you could do what you want like this
import io
import pandas as pd
df = pd.DataFrame({'column_1': [1,-1,1], 'column_2': [1,1,1]})
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
df['column_1'].value_counts().to_frame().T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
But I would suggest something like this since you would have to otherwise specify that one of the expected values were missing
with io.StringIO() as stream:
# it's easier to transpose a dataframe so that the number of rows become columns
# .to_frame to DataFrame and .T to transpose
counts = df[['column_1', 'column_2']].apply(lambda column: column.value_counts())
counts = counts.fillna(0)
counts.T.to_csv(stream, index=False)
print(stream.getvalue()) # check the csv data
Here is an example with three columns c1, c2, c3 and data frame d which is defined before the function is invoked.
import pandas as pd
import collections
def wcsv(d):
dc=[dict(collections.Counter(d[i])) for i in d.columns]
for i in dc:
if -1 not in list(i.keys()):
i[-1]=0
if 1 not in list(i.keys()):
i[1]=0
w=pd.DataFrame([ list(j.values()) for j in dc],columns=['1','-1'],index=['c1','c2','c3'])
w.to_csv("t.csv")
d=pd.DataFrame([[1,1,-1],[-1,1,1],[1,1,-1],[1,1,-1]],columns=['c1','c2','c3'])
wcsv(d)
I am using pandas to read in a csv column where every row has the following format:
IP: XXX:XX:XX:XXX
To get rid of the IP: prefix, I am editing the column after the fact:
logs['ip'] = logs['ip'].str[4:]
Is there a way to perform this operation within read_csv, maybe with regex, to avoid the post-computation?
Update |
Consider this scenario where there are multiple columns that have these prefixes – is there a better way?
logs['mac'] = logs['mac'].str[5:]
logs['id'] = logs['id'].str[4:]
logs['lan'] = logs['lan'].str[5:]
logs['ip'] = logs['ip'].str[4:]
The converters option for read_csv might provide a useful way. Let's say the file looks like this:
id address
1 IP:123.1.1.1
2 IP:456.1.1.1
3 IP:789.1.1.1
Then you could specify that 'IP:' should be converted to '' (blank) like this:
dct = { 'address': lambda x: x.replace('IP:','') }
df = pd.read_csv( 'foo.txt', delimiter=' *', converters=dct )
id address
0 1 123.1.1.1
1 2 456.1.1.1
2 3 789.1.1.1
I'm ignoring the slight complication that if there is a space after IP: then you might be reading IP: in as it's own column, but you ought to be able to adapt this fairly easily to handle that.
you could just convert the csv column to a string the use .split("IP: ")[1] on the string which will contain everything except for "IP: ". I'm not sure if this is the best approach but it's what came to mind.
str.split("IP":\s")