I have raw data containing the number of stores spread over numerous pages with no headers or columns.
Please sample below
I want to tranpose the data to this
Anyone who can help me figure out how to get the results I want?
import pandas as pd
# Creating the DataFrame
df = pd.DataFrame({"A":[12, 4, 5, None, 1],
"B":[7, 2, 54, 3, None],
"C":[20, 16, 11, 3, 8],
"D":[14, 3, None, 2, 6]})
index_ = ['Row_1', 'Row_2', 'Row_3', 'Row_4', 'Row_5']
df.index = index_
# Print the DataFrame
print(df)
> # return the transpose result = df.transpose()
> # Print the result print(result)
Related
I have a piece of code like this:
import pandas as pd
data = {
'col1': [17,2,3,4,5,5,10,22,31,11,65,86],
'col2': [6,7,8,9,10,31,46,12,20,37,91,32],
'col3': [1,2,3,4,5,6,7,8,9,10,11,12]
}
df = pd.DataFrame(data)
sampling_period = 3
abnormal_data = set()
for i in range(sampling_period):
# get index of [0, 3, 6, 9, ...], [1, 4, 7, 10, ...], and [2, 5, 8, 11, ...]
df_sampled = df[i::sampling_period]
diff = df_sampled - df_sampled.shift(1)
# diff >= 5 are considered as an abnormal columns
abnormal_df = df_sampled[
diff >= 5
].dropna(how="all", axis=1)
abnormal_data = abnormal_data.union(set(abnormal_df.columns))
print(f"abnormal_data: {abnormal_data}")
What the code above does are as the followings:
Sampling all the columns in df based on sampling_period.
If the difference between 2 consecutive elements in df_sampled is larger than or equal to 5, mark this column as abnormal.
Return abnormal columns.
Is there anyway to avoid the for loop in the code?
The code above takes a lot of time to run when sampling_period and df becomes large. I wish that it could run faster.
For example, when my sampling_period is 60, and df.shape is (20040, 3562), it takes about 683 seconds to run the above code.
I have a dataframe that has duplicated time indices and I would like to get the mean across all for the previous 2 days (I do not want to drop any observations; they are all information that I need). I've checked pandas documentation and read previous posts on Stackoverflow (such as Apply rolling mean function on data frames with duplicated indices in pandas), but could not find a solution. Here's an example of how my data frame look like and the output I'm looking for. Thank you in advance.
data:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],'t': [1, 2, 3, 2, 1, 2, 2, 3, 4],'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
output:
t
v2
1
-
2
-
3
4.167
4
5
5
6.667
A rough proposal to concatenate 2 copies of the input frame in which values in 't' are replaced respectively by values of 't+1' and 't+2'. This way, the meaning of the column 't' becomes "the target day".
Setup:
import pandas as pd
df = pd.DataFrame({'id': [1,1,1,2,3,3,4,4,4],
't': [1, 2, 3, 2, 1, 2, 2, 3, 4],
'v1':[1, 2, 3, 4, 5, 6, 7, 8, 9]})
Implementation:
len = df.shape[0]
incr = pd.DataFrame({'id': [0]*len, 't': [1]*len, 'v1':[0]*len}) # +1 in 't'
df2 = pd.concat([df + incr, df + incr + incr]).groupby('t').mean()
df2 = df2[1:-1] # Drop the days that have no full values for the 2 previous days
df2 = df2.rename(columns={'v1': 'v2'}).drop('id', axis=1)
Output:
v2
t
3 4.166667
4 5.000000
5 6.666667
Thank you for all the help. I ended up using groupby + rolling (2 Day), and then drop duplicates (keep the last observation).
I am not understanding how to essentially say: columns= [0:6, 12:15])
When I try this I get invalid syntax at the :
import pandas as pd
data = pd.read_excel (rf'C:\Users\dusti\Desktop\bulk export.xlsx',
sheet_name=1,
header=None)
df = pd.DataFrame(data,
columns= [0,1,2,3,4,5,6,12,13,14,15])
df.to_csv(rf'C:\Users\dusti\Desktop\bulk export1.csv',
header=False,
index=False)
print (df)
this thing that you trying is slicing. it used for select a subset of a list
You can use the range function for create numbers and convert it to a list with the list function
list(range(0,6+1)) + list(range(12,15+1))
#output :
[0, 1, 2, 3, 4, 5, 6, 12, 13, 14, 15]
I'm trying to save a pandas DataFrame as an excel file and import it again and convert it back to a dictionary. The data frame is quite large in size. For instance, consider the following code:
import pandas as pd
path = 'file.xlsx'
dict1 = {'a' : [3, [1, 2, 3], 'text1'],
'b' : [4, [4, 5, 6, 7], 'text2']}
print('\n\nType 1:', type(dict1['a'][1]))
df1 = pd.DataFrame(dict1)
df1.to_excel(path, sheet_name='Sheet1')
print("\n\nSaved df:\n", df1 , '\n\n')
df2 = pd.read_excel(path, sheet_name='Sheet1')
print("\n\nLoaded df:\n", df2 , '\n\n')
dict2 = df2.to_dict(orient='list')
print("New dict:", dict2, '\n\n')
print('Type 2:', type(dict2['a'][1]))
The output is:
Type 1: <class 'list'>
Saved df:
a b
0 3 4
1 [1, 2, 3] [4, 5, 6, 7]
2 text1 text2
Loaded df:
a b
0 3 4
1 [1, 2, 3] [4, 5, 6, 7]
2 text1 text2
New dict: {'a': [3, '[1, 2, 3]', 'text1'], 'b': [4, '[4, 5, 6, 7]', 'text2']}
Type 2: <class 'str'>
Could you help me get back the original dictionary with the same element types?
Thank you!
Now, there is an option with read_excel which allows us to change the dtype of the columns as they're read in, however there is no such option to change the dtype of any of the rows. So, we have to do the type conversion ourselves, after the data has been read in.
As you've shown in your question, df['a'][1] has type str, but you'd like it to have type list.
So, let's say we have some string l ='[1, 2, 3]' we could convert it to a list of ints (l=[1, 2, 3]) as [int(val) for val in l.strip('[]').split(',')]. Now, we can use this in conjunction with the .apply method to get what we desire:
df.iloc[1] = df.iloc[1].apply(lambda x : [int(val) for val in x.strip('[]').split(',')])
Putting this example back together we have:
import pandas as pd
# Data as read in by read_excel method
df2 = pd.DataFrame({'a' : [3, '[1, 2, 3]', 'text1'],
'b' : [4, '[4, 5, 6, 7]', 'text2']})
print('Type: ', type(df2['a'][1]))
#Type: <class 'str'>
# Convert strings in row 1 to lists
df2.iloc[1] = df2.iloc[1].apply(lambda x : [int(val) for val in x.strip('[]').split(',')])
print('Type: ', type(df2['a'][1]))
#Type: <class 'list'>
dict2 = df2.to_dict(orient='list')
I'm trying to drop all columns from a df that start with any of a list of strings. I needed to copy these columns to their own dfs, and now want to drop them from a copy of the main df to make it easier to analyze.
df.columns = ["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"...]
Entered some code that gave me this dataframes with these columns:
aaa.columns = ["AAA1234", "AAA5678"]
bbb.columns = ["BBB1234", "BBB5678"]
I did get the final df that I wanted, but my code felt rather clunky:
droplist_cols = [aaa, bbb]
droplist = []
for x in droplist_cols:
for col in x.columns:
droplist.append(col)
df1 = df.drop(labels=droplist, axis=1)
Columns of final df:
df1.columns = ["CCC123", "DDD123"...]
Is there a better way to do this?
--Edit for sample data--
df = pd.DataFrame([[1, 2, 3, 4, 5], [1, 3, 4, 2, 1], [4, 6, 9, 8, 3], [1, 3, 4, 2, 1], [3, 2, 5, 7, 1]], columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123"])
Desired result:
CCC123
0 5
1 1
2 3
3 1
4 1
IICU
Lets begin with a dataframe thus;
df=pd.DataFrame({"A":[0]})
Modify dataframe to include your columns
df2=df.reindex(columns=["AAA1234", "AAA5678", "BBB1234", "BBB5678", "CCC123", "DDD123"], fill_value=0)
Drop all columns starting with A
df3=df2.loc[:,~df2.columns.str.startswith('A')]
If you need to drop say A OR B I would
df3=df2.loc[:,~(df2.columns.str.startswith('A')|df2.columns.str.startswith('B'))]