scientific notation in pandas - python

I have two CSV files both files when I read an print the files the output is like:
tweet_id id
312498024964313000 3.430000e+17
312278640362659000 3.430000e+17
The id and tweet_id both columns are in the same format and required sample output is :
tweet_id id
3.124980e+17 3.430660e+17
3.122790e+17 3.430880e+17
Please tell me how to solve this problem.
I later use both of these columns to merge two CSV files.

You can set the float_format with pd.set_option. Just change both columns to float first:
pd.set_option('display.float_format', '{:.6g}'.format)
df.astype(float)
tweet_id id
0 3.12498e+17 3.43e+17
1 3.12279e+17 3.43e+17
Note: Your expected output for id doesn't seem to match your input. The above result is based on the sample input provided.

Hey i think you have to combine them into 1 data frame. You can tell because your printed statement on the bottom starts at a 0 index as well.
Use the .join method to join them, then try print them:
data_frame1 = pd.DataFrame(data=['a','b', 'c'], columns=['Alphabit'])
data_frame2 = pd.DataFrame(data=[1,2,3], columns = ['Numbers'])
data_frame1.join(data_frame2)
edit: sorry I think i misinterpreted your original question.

Related

How to only extract Close prices from Dataframe Pandas

I can't find a method to loop over my data frame (df_yf) and extract all the "Close" prices and create a new df_adj. The df is group_by coin price.
Initially, I tried something like but throwing me error.
for i in range(len(df_yf.columns):
df_adj.append(df_yf[i]["Close"])
Also tried using .get and .filter but throws me errors
"list indices must be integers or slices, not str; perhaps you missed
a comma?"
EDIT!!
Thank you for the answers. It made me realize my mistake :D. I shouldn't group_by tickers so I changed it to group_by prices (Low, Close etc.) and then was able to simply extract the right columns by doing df_adj = df_yf["Close"] as was mentioned
df_adj = np.array(df_yf["Close"])
dataframe from tables will use dict to extract columns, and then use values to get ndarray form.
df_adj = df_yf["Close"].values
If you group by Tickers, you could use:
df_adj = pd.DataFrame()
for i in [ticker[0] for ticker in df_yf]:
df_adj[i] = df_yf[i]['Close']
Result:
Ticker1 Ticker2 Ticker3
0 1 1 1
1 3 3 3

Python [Pandas/docx]: Merging two rows based on common name

I am trying to write a script using docx-python and pandas in Python3 which perform following action:
Take input from csv file
Merge common value of Column C and add each value into docx
Export docx
My raw csv is as below:
SN. Name Instance Severity
1 Name1 file1:line1 CRITICAL
2 Name1 file2:line3 CRITICAL
3 Name2 file1:line1 Low
4 Name2 file1:line3 Low
and so on...
and i want my docx outpur as:
`
[1]: https://i.stack.imgur.com/1xNc0.png
I am not able to figure it out how can i filter "Instances" based on "Name" using pandas and later print then into docx.
Thanks in advance.
Below code will select the relevant columns,group by based on 'Name' and 'Severity' and add Instances together
df2 = df[["Name","Instance","Severity"]]
df2["Instance"] = df2.groupby(['Name','Severity'])['Instance'].transform(lambda x: '\n'.join(x))
Finally, remove the duplicates and transform to get the desired output
df2 = df2.drop_duplicates()
df2 = df2.T

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

How to sort a csv file by column

I need to sort a .csv file in a very specific way but have pretty limited knowledge of python, i have got some code that works but it doesnt really do exactly what i want it to do, the format is as follows {header} {header} {header} {header}
{dataA} {dataB} {datac} {dataD}
In the csv whatever dataA is it is usually repeated 100-200 times, is there a way in which i can get dataA (e.g: examplecompany) and tell me how many times it repeats then how many times dataC repeats with dataA as the first item in the row. for example the output might be examplecompany appeared 100 times, out of those 100 datac1 appeared 45 times and datac2 appeared 55 I'm really terrible at explaining things, any help would be appreciated.
You can use csv.DictReader to read the file and then sort for the key you want.
from csv import DictReader
with open("test.csv") as f:
reader = DictReader(f)
sorted_rows = sorted(list(reader), key=lambda x: x["column1"])
CSV file I tested it with (test.csv):
column1,column2
2,bla
1,blubb
It is not clear what do you want to accomplish since you have not provided any code or a complete example of input/output for your problem.
For me, it seems that you want to count certain occurrences of data in headerC for each unique data in headerA.
Suppose you have the following .csv file:
headerA,headerB,headerC,headerD
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany2,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany2,datab,datac2,datad
examplecompany1,datab,datac1,datad
examplecompany1,datab,datac2,datad
examplecompany1,datab,datac3,datad
You can accomplish this counting with pandas. Following is an example of how you might do it.
>>> import pandas as pd
>>> df = pd.read_csv('test.csv')
>>> df.groupby(['headerA'])['headerC'].value_counts()
headerA headerC
examplecompany1 datac1 3
datac2 2
datac3 1
examplecompany2 datac2 2
datac1 1
Name: headerC, dtype: int64
Here, groupby will group the DataFrame using headerA as a reference. You can group by a single Series or a list of Series. After that, the square bracket notation is used to access the headerC column and value_counts will count each occurrence of headerC that was previously grouped by headerA. Afterwards you can just format the output for what you want.
Edit:
I forgot that you also wanted to get the number of occurrences of headerA, but that is really simple since you can get it directly by selecting the headerA column on the DataFrame df and call value_counts on it.

How to use pandas.DataFrame.assign() to add new column based on a different dataframe

I have two data frames.
df1:
filename|data
fileA|1
fileB|33
fileC|343
df2:
path|filesize|filetype
/tmp/fileA.csv|123|csv
/tmp/fileB.csv|123|csv
/tmp/fileC.csv|3534|csv
/tmp/fileD.csv|234|csv
I want the result to be
filename|data|path
fileA|1|/tmp/fileA.csv
fileB|33|/tmp/fileB.csv
fileC|343|/tmp/fileC.csv
fileD|3243|/tmp/fileD.csv
This seems extremely simple but I can't seem to get it to work with .assign(). I need to match each row that is in df1.filename with what is in df2.filepath and then add df1['filepath'] to df1.
I tried the following, but it complains that Series is not "hashable"
df1.assign(path = lambda x: df2[df2.path.str.contains(x.filename + ".csv")][path])
{TypeError}'Series' objects are mutable, thus they cannot be hashed
I tested to make sure my df1.assign() was correct by doing
df1.assign(path = lambda x: x.filename)
and it worked and just appended the filename on the df1 (which is what I would expect).
I'm assuming that the problem area is the `contains(x.filename + ".csv") being a "Series". If I change it to x.filename.values I then get
{TypeError}unhashable type: 'numpy.ndarray'. I don't understand what "x" is. I assume its a Series object, but no idea how to tell which "row" its associated with if it is.
I could brute-force this and just loop over df1 but df1 is 2M+ records and loops seem to be generally frowned upon for performance reasons with pandas. Can someone point me into what I'm doing wrong?
IIUC, I think you want to use str accessor and extract with a regex to pull filename from path and merge on filename:
df2.assign(filename=df2.path.str.extract(r'(\w+)\.csv', expand=True))\
.merge(df1, on='filename')
Output:
path filesize filetype filename data
0 /tmp/fileA.csv 123 csv fileA 1
1 /tmp/fileB.csv 123 csv fileB 33
2 /tmp/fileC.csv 3534 csv fileC 343

Categories