Is there a way to edit columns in CSV file with python? - python

I'm trying to standardize data in a large CSV file. I want to replace a string "Greek" with a different string "Q35497" but only in a single column (I don't want to replace every instance of the word "Greek" to "Q35497" in every column but just in a column named "P407"). This is what I have so far
data_frame = pd.read_csv('/data.csv') data_frame["P407"] = data_frame['P407'].astype(str) data_frame["P407"].str.replace('Greek', 'Q35497')
But what this does is just create a single column "P407" with a list of strings (such as 'Q35497') and I can't append it to the whole csv table.
I tried using DataFrame.replace
data_frame = data_frame.replace( #to_replace={"P407":{'Greek':'Q35497'}}, #inplace=True #)
But this just creates an empty set. I also can't figure out why data_frame["P407"] creates a separate series that cannot be added to the original csv file.

Your approach is correct but you have missing to store the modified dataframe.
data_frame = pd.read_csv('/data.csv')
data_frame["P407"] = data_frame["P407"].str.replace('Greek', 'Q35497')

Related

Python: extract position-dependent strings from .txt and save them to different columns of a dataframe

I have a .txt file (output.txt) from which I want to use specific strings. The required strings start at position 13 and go to the end of a line. I would like to save them to different columns of a dataframe.
I created an empty dataframe with 4 columns:
cameras = pd.DataFrame(columns=['name', 'altitude', 'latitude', 'longitude'])
and I have tried to assign the strings to different columns
with open('output.txt','r') as f:
for line in f.readlines():
if line.startswith('name'):
cameras['name'] = line[13:-1]
if line.startswith('NN'):
cameras['altitude'] = line[13:-1]
if line.startswith('lat'):
cameras['latitude'] = line[13:-1]
if line.startswith('lon'):
cameras['longitude'] = line[13:-1]
But apparently the dataframe is still empty. I guess it's an easier problem to fix.
Thanks in advance!
You can create data as an array of tuples of form (<name>, <altitude>, <latitude>, <longitude>).
Then you use pd.from_records() to create the dataframe.
There are several pitfalls here that you should be aware of. The assumption is that the input data is rows in order 'name', 'altitude', 'latitude', 'longitude'. If the order breaks (due to missing row or incorrect order), you'll get into data incosistencies. Do strict data validations.
Please refer https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.from_records.html

Compare unique identifiers, if true, do action in python Pandas

I have 2 files that contain partly the same items. To detect them there exists a unique identifier (UID).
What I try to achieve is to compare the UIDs in the first file, and compare them to the UIDs in the second file. If those are identical, another column in the first file should be filled with content in the second file of the respective column.
import pandas as pd
dfFile2 = pd.read_csv("File2.csv", sep=";")
dfFile1 = pd.read_csv("File1.csv", sep=";")
UIDURLS = dfFile2["UID"]
UIDKonf = dfFile1["UID"]
URLSurl = dfUrls["URL"]
URLSKonf = dfKonf["URL"]
for i in range(0, len(UIDKonf)):
for j in range(0, len(UIDURLS)):
if UIDKonf.at[i] == UIDURLS.at[j]:
URLSKonf.at[i] = URLSurl[j]
The code above does not give me any errors, but I also want it to directly write into the original .csv and not into the Dataframe. How could I achieve that?
Best
If you create the DataFrame with the updated information you want, you can write it back to a csv in pandas using DataFrame.to_csv

How to export a dictionary to excel using Pandas

I am trying to export some data from python to excel using Pandas, and not succeeding. The data is a dictionary, where the keys are a tuple of 4 elements.
I am currently using the following code:
df = pd.DataFrame(data)
df.to_excel("*file location*", index=False)
and I get an exported 2-column table as follows:
I am trying to get an excel table where the first 3 elements of the key are split into their own columns, and the 4th element of the key (Period in this case) becomes a column name, similar to the example below:
I have tried using different additions to the above code but I'm a bit new to this, and so nothing is working so far
Based on what you show us (which is unreplicable), you need pandas.MultiIndex
df_ = df.set_index(0) # `0` since your tuples seem to be located at the first column
df_.index = pd.MultiIndex.from_tuples(df_.index) # We convert your simple index into NDimensional index
# `~.unstack` does the job of locating your periods as columns
df_.unstack(level=-1).droplevel(0, axis=1).to_excel(
"file location", index=True
)
you could try exporting to a csv instead
df.to_csv(r'Path where you want to store the exported CSV file\File Name.csv', index = False)
which can then be converted to an excel file easily

How to write a pandas dataframe into csv file?

I don't get how to do this. I'm trying the follow:
OutputData_tmp = pd.DataFrame(columns=('frame_len', 'frame_transport_protocol', 'ip_len', 'ip_ttl', 'ip_src', 'ip_dst', 'src_port', 'dst_port', 'payload_len', 'data_len'))
to create an empty dataframe, and then, inside a for loop I do:
OutputData_tmp.loc(line)
whit 'line' being a list of float values.
Then:
OutputData_tmp.to_csv('TrainingSet\\TrainingFeatures.csv')
to save the dataframe as csv.
But when I open TrainingFeatures.csv it is empty.. only have the header (columns names)
What???
You are adding row in a dataframe wrong.
Refer to following link for adding a row.
add one row in a pandas.DataFrame
Rather than doing OutputData_tmp.loc(line), do
OutputData_tmp.loc(i) = line
Hope this helps.

How to get column names from a large file by using python dataframes

Hi I have a very huge tsv file. It is about 1GB. I just want to create an array that contains column names. This is what I've done so far:
import pandas as pd
x = pd.read_csv('mytsvfile.tsv', nrows=1).columns
Unfortunatelly, this gives me
>>> type(x)
<class 'pandas.core.indexes.base.Index'>
and when I convert it to list, the length of list is 1 which is not equal to number of columns I have in tsv file
I think you need add separator (\t if tab) and also nrows=0 works:
x = pd.read_csv('mytsvfile.tsv', nrows=0, sep='\t').columns.tolist()
If you only need the column names, and the column names are in the first line, and you need it in a python list format, why bring pandas into the mix at all? Just use readline() like so:
with open('mytsvfile.tsv', 'r') as tsv:
Columns=tsv.readline().split('\t')
Sorry about the tab character, I'm on mobile.
what you are looking for can be obtained without any intermediate:
list_of_column_names=list(x)
more generally:
list(df.columns)
you can also determine the number of columns of your dataframe df:
df.columns.nunique()

Categories