This is probably a basic question, so my apologies if it has been asked before, I have searched extensively and not been able to find an answer.
I am reading records in protobuf format, and trying to come up with a script that will write to csv. The proto file has lots of optional messages, followed by a value. I want to be able to write the value to a corresponding column.
Eg
A , B , C , D , E , F , G , H
columns
The proto messages will be a stream of random values matched to a column heading.
ie (A,1) (B,4), (H,2), (F,3)
(much more complicated, but this will do for an example). When i receive a message, i want to be able to locate the correct column, and place the value directly into it.
Note: I am writing this for others to use, so would prefer not to use Panda's for simplicity's sake. There are hundreds of columns, so is there anyway to place the value directly into a column without searching through all columns each time using == to find the corresponding one? Ie something along the lines of :
write value 3 to column F
You could turn your CSV into a dict, see Creating a dictionary from a csv file?
Then replace your dict values based on the column name / key.
Then rewrite the whole thing at once. If you need to preserve order just keep the original header row.
Related
I'm currently working with a database table that has an ID as primary key which can have up to 28 digits.
For my use case I need to manipulate some data points in this table (including the ID) and write it back to the db table.
Now, for the ID I need to increment it by one and I'm struggling to achieve this with pandas and windows.
Unfortunately and obviously, I cannot read and save the ID as plain integers in the dataframe.
Converting it to np.float64 beforehand seems to be completely messing up the values.
For example:
I'm manipulating the data point with ID 2021051800100770010113340000
If I convert the ID column to np.float64 by explicitly providing the dtype of this column,
the ID becomes 2021051800100769903675441152.0 which seems to be a completely different number to be.
Also I don't know if incrementing the ID column by 1 is working since the result will be same as the number above.
Is there a way to this in a proper way? The last option to me would be to convert it to a string and then change the last substring of that string. But I don't feel this would be good and clean solution. Not mentioning that I'm not sure if I can write this back to the db in that form.
edit//
Based on this suggestion (https://stackoverflow.com/a/21591439/3856569)
I edited the ID column the following way:
df["ID"] = df["ID"].apply(int)
and then incrementing the number.
I get the following result:
2021051800100769903675441152
2021051800100769903675441153
So the increment seems to work now but I still see completely different numbers opposed which I was getting originally.
Please bare with me to look at this problem from another angle. If we can understand how the ID is formed, we may be able to handle it differently, for example, the first 8 digits looks like a date, and if that is true, then any of your manipulation shouldn't modify those 8 digits unless your intention is to change the date. In this case, you can separate your ID (in str) into 2 parts.
20210518 / 00100770010113340000
Then now we only need to handle the second part which is still too large for np.int64. However, if you find out how it is formed, then perhaps you can further separate it and finally handle a number that can be handled by np.int64.
For example, would the ID be formed in this way?
20210518 / 001 / 007 / 7001011334 / 0000
If we can split it into segments of meaning, then we know which part we need to keep when manipulating (adding 1 in your case)
I have a CSV file that includes one column data that is not user friendly. I need to translate that data into something that makes sense. Simple find/replace seems bulky since there are dozens if not hundreds of different possible combinations I want to translate.
For instance: BLK = Black or MNT TP = Mountain Top
There are dozens if not hundreds of translations possible - I have lots of them already in a CSV table. The problem is how to use that dictionary to change the values in another CSV table. It is also important to note that this will (eventually) need to run on its own every few minutes - not just a one time translation.
It would be nice if you could describe in more detail what's the data you're working on. I'll do my best guess though.
Let's say you have a CSV file, you use pandas to read it into a data frame named df, and the "not user friendly" column named col.
To replace all the value in column col, first, you need a dictionary containing all the keys (original texts) and values (new texts):
my_dict = {"BLK": "Black", "MNT TP": Mountain Top,...}
Then, map the dictionary to the column:
df["col"] = df["col"].map(lambda x: my_dict.get(x, x))
If a key appears in the dictionary, it will be replaced by the new corresponding value in the dictionary, otherwise, it keeps the original value.
I am currently trying to read in a csv file for the purpose of creating a budget from a stament and I want to group similar items eg fuel etc. So id like to get the values from column E (aka column 5). store these values in a list and pair them with cost and then group in to lumps eg fuel. So far for simply trying to read the column I have the following
temp=pd.read_csv("statement.csv",usecols=['columnE'])
print(temp)
and the following table:
Values removed for obvious reasons. However when I run this I get the error Usecols do not match columns, why is this? I assumed I would at least get a value even if it's not the right one.
Correct the column name to
temp=pd.read_csv("statement.csv",usecols=['Transaction Description'])
and try again
I have a batch of identifier and a pair of values that behave in following manner within an iteration.
For example,
print(indexIDs[i], (coordinate_x, coordinate_y))
Sample output looks like
I would like to add these data into dataframe, where I can use indexIDs[i] as row and append incoming pair of values with same identifier in the next consecutive columns
I have attempted to perform following code, which didn't work.
spatio_location = pd.DataFrame()
spatio_location.loc[indexIDs[i], column_counter] = (coordinate_x, coordinate_y)
It was an ideal initial to associate indexIDs[i] as row, however I could not progress to take incoming data without overwriting previous dataframe. I am aware it has something to do with the second line which uses "=" sign.
I am aware my second line is keep overwriting previous result over and over again. I am looking for an appropriate way change my second line to insert new incoming data to existing dataframe without overwriting from time to time.
Appreciate your time and effort, thanks.
I'm a bit confuesed from the nature of coordinate_x (is it a list or what?) anyway maybe try to use append
you could define an empty df with three columns
df=pd.DataFrame([],columns=['a','b','c'])
after populate it with a loop on your lists
for i in range TOFILL:
df=df.append({'a':indexIDs[i],'b':coordinate_x[i],'c':coordinate_y[i]},ignore_index=True)
finally set a columns as index
df=df.set_index('a')
hope it helps
I have implemented a program in VBA for excel to generate automatic communications based on user inputs (selections of cells).
Such Macro written in VBA uses extensively the listObject function of VBA
i.e.
defining a table (list object)
Dim ClsSht As Worksheet
Set ClsSht = ThisWorkbook.Sheets("paragraph texts")
Dim ClsTbl As ListObject
Set ClsTbl = ClsSht.ListObjects(1)
accessing the table in the code in a very logical manner:
ClsTbl being now the table where I want to pick up data.
myvariable= ClsTbl.listcolumns("D1").databodyrange.item(34).value
Which means myvariable is the item (row) 34 of the data of the column D1 of the table clstbl
I decided to learn python to "translate" all that code into python and make a django based program accesable for anyone.
I am a beginner in Python and I am wondering what would be the equivalent in python to listobject of VBA. This decision will shape my whole program in python from the beginning, and I am hesitating a lot to decide what is the python equivalent to listobject in VBA.
The main idea here getting a way where I can access tables-data in a readable way,
i.e. give me the value of column "text" where column "chapter" is 3 and column paragraph is "2". The values are unique, meaning there is only one value in "text" column where that occurs.
Some observations:
I know everything can be done with lists in python, lists can contain lists that can contain lists..., but this is terrible for readability. mylist1[2][3] (assuming for instance that every row could be a list of values, and the whole table a list of lists of rows).
I don't considered an option to build any database. There are multiple relatively small tables (from 10 to 500 rows and from 3 to 15 columns) that are related but not in a database manner. That would force me to learn yet another language SQL or so, and I have more than enough with python and DJango.
The user modifies the structure of many tables (chapters coming together or getting splitted.
the data is 100% strings. The only integers are numbers to sort out text. I don't perform any mathematical operation with values but simple add together pieces of text and make replacements in texts.
the tables will be load into Python as CSV text files.
Please indicate me if there is something not enough clear in the question and I will complete it
Would it be necesary to operate with numpy? pandas?
i.e give me the value of cell
A DataFrame using pandas should provide everything you need, i.e. converstion to strings, manipulation, import and export. As a start, try
import pandas as pd
df = pd.read_csv('your_file.csv')
print(df)
print(df['text'])
The entries of the first row will be converted to labels of the DataFrame columns.