Working with Pandas in Quantum

I came across a tutorial on working with the modern Pandas API here. If you’re like me when working in Pandas, you have multiple web browser windows open containing various searches for Pandas syntax because I can never remember how to use the Pandas functions. Pandas is a great candidate for implementation in a graph processing network since graph processors do not require you to remember syntax. Working with dataframes, you often need multiple steps in your data wrangling and you are often reusing results of prior steps in later steps. As a result, I decided to write a series of posts that parallel this tutorial, however, I’ll also be using the posts to demonstrate how easy it is to wrap Pandas functions into PyCells.

Indexing

As mentioned in the tutorial, you can get rows from a dataframe by using the loc and iloc methods. The loc function allows you to select rows or columns by label when your indexes are labels. The iloc function allows you to select rows and columns by their position. Here’s the code I used for wrapping these into cells:

class Iloc(Custom):
    required = ['data']

    def __init__(self):
        self.inputs = {'axis0': None, 'axis1': None, 'data': None}
        self.outputs = {'dataframe': None}

    @data_process
    def iloc(self, h5):
        df = None
        if self.inputs['axis1'] is None:
            df = h5.df.iloc[self.inputs['axis0']]
        else:
            df = h5.df.iloc[self.inputs['axis0'], self.inputs['axis1']]
        return df

    def process(self):
        self.outputs['dataframe'] = self.iloc(self.inputs['data'])
        return super().process()

As you can see, I am subclassing the Custom class type that is provided by PyCell and I am using the data_process decorator on my iloc function. If you want to learn more about these aspects of the code, click through to the documentation.

In the __init__ function, you will see that we define our input and output sockets. These are defined as dictionaries and the keys will become the socket names when Quantum spawns the cell. To get started building your own cell, you can follow the template provided in the cookbook.

You can define whatever functions you need in the cell. In this example, I’ve defined the iloc function which will be used in the process function. The process function is a reserved name and gets called when Quantum executes the cell.

The implementation of loc is pretty much exactly the same except that we replace the h5.df.iloc call by h5.df.loc. Pretty simple. I add these to the PyCell registry dictionary (within the dataframe_cell.py file) like so:

registry += [
    {
    'name': 'ILoc',
    'module': 'PyCell.dataframe_cell',
    'categories': ['Data', 'Modify']
    },
    {
    'name': 'Loc',
    'module': 'PyCell.dataframe_cell',
    'categories': ['Data', 'Modify']
    }
]

Once this is done, you will have access to these new cells in Quantum’s contextual menu under Data>Modify>Iloc and Data>Modify>Loc. This is what the cells should look like:

Modify a Dataframe

Let’s take a look at how we can use Quantum to manipulate our data. I’ll be using the following dataset:

df = pd.read_csv('tesla-sentiment.csv')
df.tail()
  Date sentiment influence Open High Low Close Volume
281 2017-03-30 11:00:00-07:00 0.27 24 278.26 278.75 278.20 278.61 79532.0
282 2017-03-30 11:30:00-07:00 0.22 122 278.68 278.78 277.47 277.87 125126.0
283 2017-03-30 12:00:00-07:00 0.19 21 277.89 278.18 277.45 277.89 102393.0
284 2017-03-30 12:30:00-07:00 0.17 53 277.91 278.35 277.47 278.00 273989.0
285 2017-03-30 13:00:00-07:00 0.21 45 278.00 278.01 277.72 277.92 91511.0

This represents stock data where the Open, High, Low, Close represents prices for the time intervals found in Date. The meaning of other fields are unimportant for this example.

I’ll demonstrate how to perform the following manipulation:

df.loc[df['High']>279, 'High'] = df.loc[df['High']>279, 'High']/10

What we are doing is dividing any prices in the ‘High’ column by 10 if the price exceeds $279. Don’t ask me when you would ever need to do this to stock data.

We can accomplish this same task in Quantum with the Update cell. Here’s the source code for Update:

class Update(Custom):
    required = ['dataframe', 'column', 'values']

    def __init__(self):
        self.return_msg_ = "Ready to update data."
        self.inputs = {'dataframe': None, 'rows': None, 'column': None,
                       'values': None}
        self.outputs = {'dataframe': None}

    @data_process
    def update(self, h5):
        assert isinstance(self.inputs['values'], H5), "Socket must be an H5."
        assert isinstance(self.inputs['rows'].df, pd.Series), \
            "Rows must be a series."
        df = h5.df
        df.loc[self.inputs['rows'].df, self.inputs['column']] = \
            self.inputs['values'].df
        return df

    def process(self):
        self.outputs['dataframe'] = self.update(self.inputs['dataframe'])
        self.return_msg_ = 'Data updated!'
        return super().process()

We use this cell like so:

Here’s what’s going on in this circuit:

  1. 1IVI8: Read the csv file and turn it into a dataframe.
  2. 4KMIO: Define a common variable to use in later cells.
  3. 4KMBK: Grab a the data from the ‘High’ column.
  4. 4KMWW: Filter the rows that are greater than 279.
  5. 4KNI8: Calculate the values in ‘High’ divided by 10.
  6. 4K03K: Set the values of the filtered rows to the calculated values.

Here’s the result:

  Date sentiment influence Open High Low Close Volume
271 2017-03-29 13:00:00-07:00 0.02 114 277.36 277.380 277.12 277.38 56145.0
272 2017-03-30 06:30:00-07:00 0 52 278.04 28.200 277.21 281.45 1075623.0
273 2017-03-30 07:00:00-07:00 0.01 24 281.37 28.147 279.21 279.91 478321.0
274 2017-03-30 07:30:00-07:00 0 15 279.87 28.075 279.55 280.00 274664.0
275 2017-03-30 08:00:00-07:00 0.01 37 280.00 28.160 279.95 280.39 345942.0
276 2017-03-30 08:30:00-07:00 0.11 87 280.37 28.090 279.17 279.30 235248.0
277 2017-03-30 09:00:00-07:00 0.12 89 279.24 27.959 278.53 279.58 160264.0
278 2017-03-30 09:30:00-07:00 0.11 12 279.51 27.951 278.42 278.54 100704.0
279 2017-03-30 10:00:00-07:00 0.16 23 278.69 279.000 278.33 278.94 111686.0
280 2017-03-30 10:30:00-07:00 0.23 19 278.89 27.912 278.15 278.34 117323.0
281 2017-03-30 11:00:00-07:00 0.27 24 278.26 278.750 278.20 278.61 79532.0
282 2017-03-30 11:30:00-07:00 0.22 122 278.68 278.780 277.47 277.87 125126.0
283 2017-03-30 12:00:00-07:00 0.19 21 277.89 278.180 277.45 277.89 102393.0
284 2017-03-30 12:30:00-07:00 0.17 53 277.91 278.350 277.47 278.00 273989.0
285 2017-03-30 13:00:00-07:00 0.21 45 278.00 278.010 277.72 277.92 91511.0

Wrap Up

I’ll wrap up the post here. Hopefully, you saw how easy it is to define new cells for Quantum and how you can use Quantum for some basic data manipulation. In the next post I’ll cover some additional useful data manipulation cells as well as how you can visualize your data in Quantum. Thanks for reading!