My python script produces a dictionary as follows:
================================================================
TL&DR
I overcomplicated the problem by using from_dict method, while creating a dataframe from dictionary. Thanks to #Sword.
In other words, pd.DataFrame.from_dict is only needed if you want to create a dataframe with all keys in one column, all values in another column. In all other cases, it is as simple as the approach mentioned in the accepted answer.
==============================================================
{u'19:00': 2, u'12:00': 1, u'06:00': 2, u'00:00': 0, u'23:00': 2, u'05:00': 2, u'11:00': 4, u'14:00': 2, u'04:00': 0, u'09:00': 7, u'03:00': 1, u'18:00': 6, u'01:00': 0, u'21:00': 5, u'15:00': 8, u'22:00': 1, u'08:00': 5, u'16:00': 8, u'02:00': 0, u'13:00': 8, u'20:00': 5, u'07:00': 11, u'17:00': 12, u'10:00': 8}
and it also produces a variable, let's say full_name (taken as an argument to the script) which has the value "John".
Everytime I run the script, it gives me a dictionary and name in the aforementioned format.
I want to write this into a csv file for later analysis in the following format:
FULLNAME | 00:00 | 01:00 | 02:00 | .....| 22:00 | 23:00 |
John | 0 | 0 | 0 | .....| 1 | 2 |
My code to produce that is as follows:
import collections
import pandas as pd
# ........................
# Other part of code, which produces the dictionary by name "data_dict"
# ........................
#Sorting the dictionary (And adding it to a ordereddict) in order to skip matching dictionary keys with column headers
data_dict_sorted = collections.OrderedDict(sorted(data_dict.items()))
# For the first time to produce column headers, I used .items() and rest of the following lines follows it.
# df = pd.DataFrame.from_dict(data_dict_sorted.items())
#For the second time onwards, I just need to append the values, I am using .values()
df = pd.DataFrame.from_dict(data_dict_sorted.values())
df2 = df.T # transposing because from_dict creates all keys in one column, and corresponding values in the next column.
df2.columns = df2.iloc[0]
df3 = df2[1:]
df3["FULLNAME"] = args.name #This is how we add a value, isn't it?
df3.to_csv('test.csv', mode = 'a', sep=str('\t'), encoding='utf-8', index=False)
My code is producing the following csv
00:00 | 01:00 | 02:00 | …….. | 22:00 | 23:00 | FULLNAME
0 | 0 | 0 | …….. | 1 | 2 | John
0 | 0 | 0 | …….. | 1 | 2 | FULLNAME
0 | 0 | 0 | …….. | 1 | 2 | FULLNAME
My question is two fold:
Why is it printing "FULLNAME" instead of "John" in the second iteration (as in the second time the script is run)? What am I missing?
is there a better way to do this?
How about this?
df = pd.DataFrame(data_dict, index=[0])
df['FullName'] = 'John'
EDIT:
It is a bit difficult to understand the way you are conducting the operations but it looks like the issue is with the line df.columns = df.iloc[0] . The above code I've mentioned will not need the assignment of column names or the transpose operation. If you are adding a dictionary at each iteration, try:
data_dict['FullName'] = 'John'
df = df.append(pd.DataFrame(data_dict, index =[0]), ignore_index = True).reset_index()
If each row might have a different name, then df['FullName'] = 'John' will cause the entire column to equate to John. Hence as a better step, create a key called 'FullName' in your dict with the appropriate name as its value to avoid assigning a uniform value to the entire column i.e
data_dict['FullName'] = 'John'
Related
I am trying to merge rows with each other to get one row containing all the values that are present. Currently the df look like this:
dataframe
What i want is something like:
| index | scan .. | snel. | kool .. | note .. |
| ----- | ------- | ----- | ------- | ------- |
| 0 | 7,8 | 4,0 | 20.0 | Fiasp, ..|
I can get that output in the code example below but it just seems really messy.
I tried to use groupby, agg, sum, max, and all those do is that it removes columns and looks like this:
df2.groupby('Tijdstempel apparaat').max().reset_index()
I tried filling the row with the values of the previous rows, and then drop the rows that dont contain every value. But this seems like a long work around and really messy.
df2 = df2.loc[df['Tijdstempel apparaat'] == '20-01-2023 13:24']
df2 = df2.reset_index()
del df2['index']
df2['Snelwerkende insuline (eenheden)'].fillna(method='pad', inplace=True)
df2['Koolhydraten (gram)'].fillna(method='pad', inplace=True)
df2['Notities'].fillna(method='pad', inplace=True)
df2['Scan Glucose mmol/l'].fillna(method='pad', inplace=True)
print(df2)
# df2.loc[df2[0,'Snelwerkende insuline (eenheden)']] = df2.loc[df2[1, 'Snelwerkende insuline (eenheden)']]
df2.drop([0, 1, 2])
Output:
When i have to do this for the entire data.csv (whenever a time stamp like "20-01-2023 13:24" is found multiple times) i am worried it wil be really slow and time consuming.
sample data as your data
df = pd.DataFrame(data={
"times":["date1","date1","date1","date1","date1"],
"type":[1,2,3,4,5],
"key1":[1,None,None,None,None],
"key2":[None,"2",None,None,None],
"key3":[None,None,3,None,None],
"key4":[None,None,None,"val",None],
"key5":[None,None,None,None,5],
})
solution
melt = df.melt(id_vars="times",
value_vars=df.columns[1:],)
melt = melt.dropna()
pivot = melt.pivot_table(values="value", index="times", columns="variable", aggfunc=lambda x: x)
change type column location
index = list(pivot.columns).index("type")
pivot = pd.concat([pivot.iloc[:,index:], pivot.iloc[:,:index]], axis=1)
I have two csv mirror files generated by two different servers. Both files have the same number of lines and should have the exact same unix timestamp column. However, due to some clock issues, some records in one file, might have asmall difference of a nanosecond than it's counterpart record in the other csv file, see below an example, the difference is always of 1:
dataframe_A dataframe_B
| | ts_ns | | | ts_ns |
| -------- | ------------------ | | -------- | ------------------ |
| 1 | 1661773636777407794| | 1 | 1661773636777407793|
| 2 | 1661773636786474677| | 2 | 1661773636786474677|
| 3 | 1661773636787956823| | 3 | 1661773636787956823|
| 4 | 1661773636794333099| | 4 | 1661773636794333100|
Since these are huge files with milions of lines, I use pandas and dask to process them, but before I process, I need to ensure they have the same timestamp column.
I need to check the difference between column ts_ns in A and B and if there is a difference of 1 or -1 I need to replace the value in B with the corresponding ts_ns value in A so I can finally have the same ts_ns value in both files for corresponding records.
How can I do this in a decent way using pandas/dask?
If you're sure that the timestamps should be identical, why don't you simply use the timestamp column from dataframe A and overwrite the timestamp column in dataframe B with it?
Why even check whether the difference is there or not?
You can use the pandas merge_asof function for this, see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html . The tolerance allows for a int or timedelta which should be set to the +1 for your example with direction being nearest.
Assuming your files are identical except from your ts_ns column you can perform a .merge on indices.
df_a = pd.DataFrame({'ts_ns': [1661773636777407794, 1661773636786474677, 1661773636787956823, 1661773636794333099]})
df_b = pd.DataFrame({'ts_ns': [1661773636777407793, 1661773636786474677, 1661773636787956823, 1661773636794333100]})
df_b = (df_b
.merge(df_a, how='left', left_index=True, right_index=True, suffixes=('', '_a'))
.assign(
ts_ns = lambda df_: np.where(abs(df_.ts_ns - df_.ts_ns_a) <= 1, df_.ts_ns_a, df_.ts_ns)
)
.loc[:, ['ts_ns']]
)
But I agree with #ManEngel, just overwrite all the values if you know they are identical.
I'm using Python.
I have extracted text from pdf. So I have a data frame full of strings with just one column and no column name.
I need to filter rows from a starting row until the end. This starting row is identified because starts with certain characters. Consider the following example:
+----------------+
| aaaaaaa |
| bbbbbb |
| ccccccc |
| hellodddd |
| eeeeeeeee |
| fffffffffff |
| gggggggg |
| hhhhhhhh |
+----------------+
I need to filter rows from the starting row, which is hellodddd until the end. As you can see, the starting row is identified because startswith hello characters.
So, the expected output is:
+----------------+
| hellodddd |
| eeeeeeeee |
| fffffffffff |
| gggggggg |
| hhhhhhhh |
+----------------+
I think this example can be reproduced with the following code:
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'gggggggg', 'hhhhhhhh']
df = pd.DataFrame(mylist)
I think I need to use startswith() function first to identify the starting row. But, then, what could I do to select the wanted columns (the ones that follow the starting row until the end)?
.startswith() is a method on a string, returning whether or not a string starts with some substring, it won't help you select rows in a dataframe (unless you're looking for the first row with a value that starts with that string).
You're looking for something like:
import pandas as pd
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)
print(df[(df[0].values == 'hellodddd').argmax():])
Result:
0
3 hellodddd
4 eeeeeeeee
5 fffffffffff
6 hellodddd
7 hhhhhhhh
Note that I replaced a later value with 'hellodddd' as well, to show that it will include all rows from the first match onwards.
Edit: in response to the comment:
import pandas as pd
mylist = ['aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee', 'fffffffffff', 'hellodddd', 'hhhhhhhh']
df = pd.DataFrame(mylist)
print(df[(df[0].str.startswith('hello')).argmax():])
Result is identical.
I don't know much about panda, but I know that itertools can solve this problem:
import itertools
mylist = [
'aaaaaaa', 'bbbbbb', 'ccccccc', 'hellodddd', 'eeeeeeeee',
'fffffffffff', 'gggggggg', 'hhhhhhhh'
]
result = list(itertools.dropwhile(
lambda element: not element.startswith("hello"),
mylist,
))
The dropwhile function drop (discard) those element that fits the condition, after that, it returns the rest.
I want to append columns from tables generated in a loop to a dataframe. I was hoping to accomplish this using pandas.merge, but it doesn't seem to be working out for me.
My code:
from datetime import date
from datetime import timedelta
import pandas
import numpy
import pyodbc
date1 = date(2017, 1, 1) #Starting Date
date2 = date(2017, 1, 10) #Ending Date
DateDelta = date2 - date1
DateAdd = DateDelta.days
StartDate = date1
count = 1
# Create the holding table
conn = pyodbc.connect('Server Information')
**basetable = pandas.read_sql("SELECT....")
while count <= DateAdd:
print(StartDate)
**datatable = pandas.read_sql("SELECT...WHERE Date = "+str(StartDate)+"...")
finaltable = basetable.merge(datatable,how='left',left_on='OrganizationName',right_on='OrganizationName')
StartDate = StartDate + timedelta(days=1)
count = count + 1
print(finaltable)
Shortened the select statements for brevity's sake, but the tables produced look like this:
**Basetable
School_District
---------------
District_Alpha
District_Beta
...
District_Zed
**Datatable
School_District|2016-01-01|
---------------|----------|
District_Alpha | 400 |
District_Beta | 300 |
... | 200 |
District_Zed | 100 |
I have the datatable written so the column takes the name of the date selected for that particular loop, so column names can be unique once i get this up and running. My problem, however, is that the above code only produces one column of data. I have a good guess as to why: Only the last merge is being processed - I thought using pandas.append would be the way to get around that, but pandas.append doesn't "join" like merge does. Is there some other way to accomplish a sort of Join & Append using Pandas? My goal is to keep this flexible so that other dates can be easily input depending on our data needs.
In the end, what I want to see is:
School_District|2016-01-01|2016-01-02|... |2016-01-10|
---------------|----------|----------|-----|----------|
District_Alpha | 400 | 1 | | 45 |
District_Beta | 300 | 2 | | 33 |
... | 200 | 3 | | 5435 |
District_Zed | 100 | 4 | | 333 |
Your error is in the statement finaltable = basetable.merge(datatable,...). At each loop iteration, you merge the original basetable with the new datatable, store the result in the finaltable... and discard it. What you need is basetable = basetable.merge(datatable,...). No finaltables.
I've got a basic dictionary that gives me a count of how many times data shows up. e.g. Adam: 10, Beth: 3, ... , Zack: 1
If I do df = pd.DataFrame([dataDict]).T then the keys from the dictionary become the index of the dataframe and I only have 1 true column of data. I've looked by I haven't found a way around this so any help would be appreciated.
Edit: More detail
The dictionary was formed from a count function of another dataframe e.g. dataDict = df1.Name.value_counts().to_dict ()
This is my expected output.
| Name | Count
------ | -----|------
0 | Adam | 10
------ | -----|------
1 | Beth | 3
What I'm getting at the moment is this:
| Count
-----|------
Adam | 10
-----|------
Beth | 3
try reset_index
dataDict = dict(Adam=10, Beth=3, Zack=1)
df = pd.Series(dataDict).rename_axis('Name').reset_index(name='Count')
df