Quantum

Working with Pandas in Quantum

Apr 9, 2017 • Thomas Chen

I came across a tutorial on working with the modern Pandas API here. If you’re like me when working in Pandas, you have multiple web browser windows open containing various searches for Pandas syntax because I can never remember how to use the Pandas functions. Pandas is a great candidate for implementation in a graph processing network since graph processors do not require you to remember syntax. Working with dataframes, you often need multiple steps in your data wrangling and you are often reusing results of prior steps in later steps. As a result, I decided to write a series of posts that parallel this tutorial, however, I’ll also be using the posts to demonstrate how easy it is to wrap Pandas functions into PyCells.

Indexing

As mentioned in the tutorial, you can get rows from a dataframe by using the loc and iloc methods. The loc function allows you to select rows or columns by label when your indexes are labels. The iloc function allows you to select rows and columns by their position. Here’s the code I used for wrapping these into cells:

class Iloc(Custom):
    required = ['data']

    def __init__(self):
        self.inputs = {'axis0': None, 'axis1': None, 'data': None}
        self.outputs = {'dataframe': None}

    @data_process
    def iloc(self, h5):
        df = None
        if self.inputs['axis1'] is None:
            df = h5.df.iloc[self.inputs['axis0']]
        else:
            df = h5.df.iloc[self.inputs['axis0'], self.inputs['axis1']]
        return df

    def process(self):
        self.outputs['dataframe'] = self.iloc(self.inputs['data'])
        return super().process()

As you can see, I am subclassing the Custom class type that is provided by PyCell and I am using the data_process decorator on my iloc function. If you want to learn more about these aspects of the code, click through to the documentation.

In the __init__ function, you will see that we define our input and output sockets. These are defined as dictionaries and the keys will become the socket names when Quantum spawns the cell. To get started building your own cell, you can follow the template provided in the cookbook.

You can define whatever functions you need in the cell. In this example, I’ve defined the iloc function which will be used in the process function. The process function is a reserved name and gets called when Quantum executes the cell.

The implementation of loc is pretty much exactly the same except that we replace the h5.df.iloc call by h5.df.loc. Pretty simple. I add these to the PyCell registry dictionary (within the dataframe_cell.py file) like so:

registry += [
    {
    'name': 'ILoc',
    'module': 'PyCell.dataframe_cell',
    'categories': ['Data', 'Modify']
    },
    {
    'name': 'Loc',
    'module': 'PyCell.dataframe_cell',
    'categories': ['Data', 'Modify']
    }
]

Once this is done, you will have access to these new cells in Quantum’s contextual menu under Data>Modify>Iloc and Data>Modify>Loc. This is what the cells should look like:

Modify a Dataframe

Let’s take a look at how we can use Quantum to manipulate our data. I’ll be using the following dataset:

df = pd.read_csv('tesla-sentiment.csv')
df.tail()

	Date	sentiment	influence	Open	High	Low	Close	Volume
281	2017-03-30 11:00:00-07:00	0.27	24	278.26	278.75	278.20	278.61	79532.0
282	2017-03-30 11:30:00-07:00	0.22	122	278.68	278.78	277.47	277.87	125126.0
283	2017-03-30 12:00:00-07:00	0.19	21	277.89	278.18	277.45	277.89	102393.0
284	2017-03-30 12:30:00-07:00	0.17	53	277.91	278.35	277.47	278.00	273989.0
285	2017-03-30 13:00:00-07:00	0.21	45	278.00	278.01	277.72	277.92	91511.0

This represents stock data where the Open, High, Low, Close represents prices for the time intervals found in Date. The meaning of other fields are unimportant for this example.

I’ll demonstrate how to perform the following manipulation:

df.loc[df['High']>279, 'High'] = df.loc[df['High']>279, 'High']/10

What we are doing is dividing any prices in the ‘High’ column by 10 if the price exceeds $279. Don’t ask me when you would ever need to do this to stock data.

We can accomplish this same task in Quantum with the Update cell. Here’s the source code for Update:

class Update(Custom):
    required = ['dataframe', 'column', 'values']

    def __init__(self):
        self.return_msg_ = "Ready to update data."
        self.inputs = {'dataframe': None, 'rows': None, 'column': None,
                       'values': None}
        self.outputs = {'dataframe': None}

    @data_process
    def update(self, h5):
        assert isinstance(self.inputs['values'], H5), "Socket must be an H5."
        assert isinstance(self.inputs['rows'].df, pd.Series), \
            "Rows must be a series."
        df = h5.df
        df.loc[self.inputs['rows'].df, self.inputs['column']] = \
            self.inputs['values'].df
        return df

    def process(self):
        self.outputs['dataframe'] = self.update(self.inputs['dataframe'])
        self.return_msg_ = 'Data updated!'
        return super().process()

We use this cell like so:

Here’s what’s going on in this circuit:

1IVI8: Read the csv file and turn it into a dataframe.
4KMIO: Define a common variable to use in later cells.
4KMBK: Grab a the data from the ‘High’ column.
4KMWW: Filter the rows that are greater than 279.
4KNI8: Calculate the values in ‘High’ divided by 10.
4K03K: Set the values of the filtered rows to the calculated values.

Here’s the result:

	Date	sentiment	influence	Open	High	Low	Close	Volume
271	2017-03-29 13:00:00-07:00	0.02	114	277.36	277.380	277.12	277.38	56145.0
272	2017-03-30 06:30:00-07:00	0	52	278.04	28.200	277.21	281.45	1075623.0
273	2017-03-30 07:00:00-07:00	0.01	24	281.37	28.147	279.21	279.91	478321.0
274	2017-03-30 07:30:00-07:00	0	15	279.87	28.075	279.55	280.00	274664.0
275	2017-03-30 08:00:00-07:00	0.01	37	280.00	28.160	279.95	280.39	345942.0
276	2017-03-30 08:30:00-07:00	0.11	87	280.37	28.090	279.17	279.30	235248.0
277	2017-03-30 09:00:00-07:00	0.12	89	279.24	27.959	278.53	279.58	160264.0
278	2017-03-30 09:30:00-07:00	0.11	12	279.51	27.951	278.42	278.54	100704.0
279	2017-03-30 10:00:00-07:00	0.16	23	278.69	279.000	278.33	278.94	111686.0
280	2017-03-30 10:30:00-07:00	0.23	19	278.89	27.912	278.15	278.34	117323.0
281	2017-03-30 11:00:00-07:00	0.27	24	278.26	278.750	278.20	278.61	79532.0
282	2017-03-30 11:30:00-07:00	0.22	122	278.68	278.780	277.47	277.87	125126.0
283	2017-03-30 12:00:00-07:00	0.19	21	277.89	278.180	277.45	277.89	102393.0
284	2017-03-30 12:30:00-07:00	0.17	53	277.91	278.350	277.47	278.00	273989.0
285	2017-03-30 13:00:00-07:00	0.21	45	278.00	278.010	277.72	277.92	91511.0

Wrap Up

I’ll wrap up the post here. Hopefully, you saw how easy it is to define new cells for Quantum and how you can use Quantum for some basic data manipulation. In the next post I’ll cover some additional useful data manipulation cells as well as how you can visualize your data in Quantum. Thanks for reading!