Creating new dataframe with .txt file using Pandas - python

I have a text file with data displayed like this:
{"created_at":"Mon Jun 02 00:04:00 +0000 2018","id":870430762953920,"id_str":"87043076220","text":"Hello there","source":"\u003ca href=\"http:\/\/tapbots.com\/software\/tweetbot\/mac\" rel=\"nofollow\"\u003eTweetbot for Mac\u003c\/a\u003e","truncated":false,"in_reply_to_status_id"}
The data is twitter posts and I have hundreds of these in one text file. I want to get the key value pair of "text":"Hello there" and turn that into it's own dataframe with a third column named target. I don't need any of the other columns. I'm doing some sensitivity analysis.
What would be the most pythonic way to go about this? I thought about using the
df = pd.read_csv('test.txt', sep=r'"'), but then I don't know how to get rid of all the other columns i don't need and select the column with the text in it.
Any help would be much appreciated!

I had to modify the lost two key/value pairs in your data to work. You may want to check if you're getting the data correctly or if you copy and pasted properly because you should be getting errors with the data as is displayed in your post.
"truncated":False,"in_reply_to_status_id":1
Then this worked well for me:
import pandas as pd
with open('test.txt','r') as inf1: # reads the text file as code to evaluate
d =eval(inf1.read())
index = range(len(d))
df = pd.DataFrame(d,index=index) # have to add index to because the entire df are scalar values
df = df.pop('text')
print(df)
Returns
0 Hello there
1 Hello there
2 Hello there
3 Hello there
4 Hello there
5 Hello there
6 Hello there
Name: text, dtype: object

Related

Python [Pandas/docx]: Merging two rows based on common name

I am trying to write a script using docx-python and pandas in Python3 which perform following action:
Take input from csv file
Merge common value of Column C and add each value into docx
Export docx
My raw csv is as below:
SN. Name Instance Severity
1 Name1 file1:line1 CRITICAL
2 Name1 file2:line3 CRITICAL
3 Name2 file1:line1 Low
4 Name2 file1:line3 Low
and so on...
and i want my docx outpur as:
`
[1]: https://i.stack.imgur.com/1xNc0.png
I am not able to figure it out how can i filter "Instances" based on "Name" using pandas and later print then into docx.
Thanks in advance.
Below code will select the relevant columns,group by based on 'Name' and 'Severity' and add Instances together
df2 = df[["Name","Instance","Severity"]]
df2["Instance"] = df2.groupby(['Name','Severity'])['Instance'].transform(lambda x: '\n'.join(x))
Finally, remove the duplicates and transform to get the desired output
df2 = df2.drop_duplicates()
df2 = df2.T

How to read SPSS aka (.sav) in Python

It's my first time using Jupyter Notebook to analyze survey data (.sav file), and I would like to read it in a way it will show the metadata so I can connect the answers with the questions. I'm totally a newbie in this field, so any help is appreciated!
import pandas as pd
import pyreadstat
df, meta = pyreadstat.read_sav('./SimData/survey_1.sav')
type(df)
type(meta)
df.head()
Please lmk if there is an additional step needed for me to be able to see the metadata!
The meta object contains the metadata you are looking for. Probably the most useful attributes to look at are:
meta.column_names_to_labels : it's a dictionary with column names as you have in your pandas dataframe to labels meaning longer explanations on the meaning of each column
print(meta.column_names_to_labels)
meta.variable_value_labels : a dict where keys are column names and values are a dict where the keys are values you find in your dataframe and values are value labels.
print(meta.variable_value_labels)
For instance if you have a column "gender' with values 1 and 2, you could get:
{"gender": {1:"male", 2:"female"}}
which means value 1 is male and 2 female.
You can get those labels from the beginning if you pass the argument apply_value_formats :
df, meta = pyreadstat.read_sav('survey.sav', apply_value_formats=True)
You can also apply those value formats to your dataframe anytime with pyreadstat.set_value_labels which returns a copy of your dataframe with labels:
df_copy = pyreadstat.set_value_labels(df, meta)
meta.missing_ranges : you get labels for missing values. Let's say in the survey in certain variable they encoded 1 meaning yes, 2 no and then mussing values, 5 meaning didn't answer, 6 person not at home. When you read the dataframe by default you will get values 1 and 2 and NaN (missing) instead of 5 and 6. You can pass the argument user_missing to get 5 and 6, and meta.missing_ranges will tell you that 5 and 6 are missing values. Variable_value_labels will give you the "didn't answer" and "person not at home" labels.
df, meta = pyreadstat.read_sav("survey.sav", user_missing=True)
print(meta.missing_ranges)
print(meta.variable_value_labels)
These are the potential pieces of information useful for your case, not necessarily all of these pieces will be present in your dataset.
More information here: https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html

How do I extract variables that repeat from an Excel Column using Python?

I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)

Force Pandas to keep multiple columns with the same name

I'm building a program that collects data and adds it to an ongoing excel sheet weekly (read_excel() and concat() with the new data). The issue I'm having is that I need the columns to have the same name for presentation (it doesn't look great with x.1, x.2, ...).
I only need this on the final output. Is there any way to accomplish this? Would it be too time consuming to modify pandas?
you can create a list of custom headers that will be read into excel
newColNames = ['x','x','x'.....]
df.to_excel(path,header=newColNames)
You can add spaces to the end of the column name. It will appear the same in a Excel, but pandas can distinguish the difference.
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=['x','x ','x '])
df
x x x
0 1 2 3
1 4 5 6
2 7 8 9

pandas dataframe: duplicates based on column and time range

I have a (very simplyfied here) pandas dataframe which looks like this:
df
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
2 2012-11-21 17:00:08 u3 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
without the third row, as its text is the same as in row one and two, but its timestamp is not
within the range of 3 seconds.
I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:
mask = df.duplicated(subset=['datetime', 'msg'], keep=False)
print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []
Is there a way where I can define a range for my "datetime" parameter? To illustrate, something
like:
mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)
Any help here would as always be very much appreciated.
This Piece of code gives the expected output
df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]
I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.
Before using above code make sure that your dataframe is sorted on datetime in ascending order.
This bit of code works on your example data, although you might have to play around with any extreme cases.
From your question I'm assuming you want to filter out messages from the first time it appears in df. It won't work if you have instances where you want to keep the string if it appears again after another threshold.
In short I wrote a function that will take your dataframe and the 'msg' to filter for. It takes the timestamp of the first time the message appears and compares that to all the other times it appears.
It then selects only the instances where it appears within 3 seconds of the first appearance.
import numpy as np
import pandas as pd
#function which will return dataframe containing messages within three seconds of the first message
def get_info_within_3seconds(df, msg):
df_of_msg = df[df['msg']==msg].sort_values(by = 'datetime')
t1 = df_of_msg['datetime'].reset_index(drop = True)[0]
datetime_deltas = [(i -t1).total_seconds() for i in df_of_msg['datetime']]
filter_list = [i <= 3.0 for i in datetime_deltas]
return df_of_msg[filter_list]
msgs = df['msg'].unique()
#apply function to each unique message and then create a new df
new_df = pd.concat([get_info_within_3seconds(df, i) for i in msgs])

Categories