OpenSesame
Rapunzel Code Editor
DataMatrix
Support forum
Python Tutorials
MindProbe
Python videos

Basic use

Ultra-short cheat sheet:

from datamatrix import DataMatrix, io
# Read a DataMatrix from file
dm = io.readtxt('data.csv')
# Create a new DataMatrix
dm = DataMatrix(length=5)
# The first two rows
print(dm[:2])
# Create a new column and initialize it with the Fibonacci series
dm.fibonacci = 0, 1, 1, 2, 3
# You can also specify column names as if they are dict keys
dm['fibonacci'] = 0, 1, 1, 2, 3
# Remove 0 and 3 with a simple selection
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)
# Get a list of indices that match certain criteria
print(dm[(dm.fibonacci > 0) & (dm.fibonacci < 3)])
# Select 1, 1, and 2 by matching any of the values in a set
dm = dm.fibonacci == {1, 2}
# Select all odd numbers with a lambda expression
dm = dm.fibonacci == (lambda x: x % 2)
# Change all 1s to -1
dm.fibonacci[dm.fibonacci == 1] = -1
# The first two cells from the fibonacci column
print(dm.fibonacci[:2])
# Column mean
print('Mean: %s' % dm.fibonacci.mean)
# Multiply all fibonacci cells by 2
dm.fibonacci_times_two = dm.fibonacci * 2
# Loop through all rows
for row in dm:
    print(row.fibonacci) # get the fibonacci cell from the row
# Loop through all columns
for colname, col in dm.columns:
    for cell in col: # Loop through all cells in the column
        print(cell) # do something with the cell
# Or just see which columns exist
print(dm.column_names)

Important note: Because of a limitation (or feature, if you will) of the Python language, the behavior of and, or, and chained (x < y < z) comparisons cannot be modified. These therefore do not work with DataMatrix objects as you would expect them to:

# INCORRECT: The following does *not* work as expected
dm = dm.fibonacci > 0 and dm.fibonacci < 3
# INCORRECT: The following does *not* work as expected
dm = 0 < dm.fibonacci < 3
# CORRECT: Use the '&' operator
dm = (dm.fibonacci > 0) & (dm.fibonacci < 3)

Slightly longer cheat sheet:

Basic operations

Creating a DataMatrix

Create a new DataMatrix object, and add a column (named col). By default, the column is of the MixedColumn type, which can store numeric and string data.

import sys
from datamatrix import DataMatrix, __version__
dm = DataMatrix(length=2)
dm.col = ':-)'
print(
    'Examples generated with DataMatrix v{} on Python {}\n'.format(
        __version__,
        sys.version
    )
)
print(dm)

Output:

Examples generated with DataMatrix v1.0.0a5 on Python 3.10.5 | packaged by conda-forge | (main, Jun 14 2022, 07:04:59) [GCC 10.3.0]

+---+-----+
| # | col |
+---+-----+
| 0 | :-) |
| 1 | :-) |
+---+-----+

You can change the length of the DataMatrix later on. If you reduce the length, data will be lost. If you increase the length, empty cells will be added.

dm.length = 3

Concatenating two DataMatrix objects

You can concatenate two DataMatrix objects using the << operator. Matching columns will be combined. (Note that row 2 is empty. This is because we have increased the length of dm in the previous step, causing an empty row to be added.)

dm2 = DataMatrix(length=2)
dm2.col = ';-)'
dm2.col2 = 10, 20
dm3 = dm << dm2
print(dm3)

Output:

+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 | :-) |      |
| 1 | :-) |      |
| 2 |     |      |
| 3 | ;-) |  10  |
| 4 | ;-) |  20  |
+---+-----+------+

Creating columns

You can change all cells in column to a single value. This creates a new column if it doesn't exist yet.

dm.col = 'Another value'
print(dm)

Output:

+---+---------------+
| # |      col      |
+---+---------------+
| 0 | Another value |
| 1 | Another value |
| 2 | Another value |
+---+---------------+

You can change all cells in a column based on a sequence. This creates a new column if it doesn't exist yet. This sequence must have the same length as the column (3 in this case).

dm.col = 1, 2, 3
print(dm)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  1  |
| 1 |  2  |
| 2 |  3  |
+---+-----+

If you do not know the name of a column, for example because it is defined by a variable, you can also refer to columns as though they are items of a dict. However, this is not recommended, because it makes it less clear whether you are referring to column or a row.

dm['col'] = 'X'
print(dm)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  X  |
| 1 |  X  |
| 2 |  X  |
+---+-----+

Renaming columns

dm.rename('col', 'col2')
print(dm)

Output:

+---+------+
| # | col2 |
+---+------+
| 0 |  X   |
| 1 |  X   |
| 2 |  X   |
+---+------+

Deleting columns

You can delete a column using the del keyword:

dm.col = 'x'
del dm.col2
print(dm)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  x  |
| 1 |  x  |
| 2 |  x  |
+---+-----+

Slicing and assigning to column cells

Assign to one cell

dm.col[1] = ':-)'
print(dm)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  x  |
| 1 | :-) |
| 2 |  x  |
+---+-----+

Assign to multiple cells

This changes row 0 and 2. It is not a slice!

dm.col[0,2] = ':P'
print(dm)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  :P |
| 1 | :-) |
| 2 |  :P |
+---+-----+

Assign to a slice of cells

dm.col[1:] = ':D'
print(dm)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  :P |
| 1 |  :D |
| 2 |  :D |
+---+-----+

Assign to cells that match a selection criterion

dm.col[1:] = ':D'
dm.is_happy = 'no'
dm.is_happy[dm.col == ':D'] = 'yes'
print(dm)

Output:

+---+-----+----------+
| # | col | is_happy |
+---+-----+----------+
| 0 |  :P |    no    |
| 1 |  :D |   yes    |
| 2 |  :D |   yes    |
+---+-----+----------+

Column properties

Basic numeric properties, such as the mean, can be accessed directly. Only numeric values are taken into account.

dm.col = 1, 2, 'not a number'
# Numeric descriptives
print('mean: %s' % dm.col.mean)
print('median: %s' % dm.col.median)
print('standard deviation: %s' % dm.col.std)
print('sum: %s' % dm.col.sum)
print('min: %s' % dm.col.min)
print('max: %s' % dm.col.max)
# Other properties
print('unique values: %s' % dm.col.unique)
print('number of unique values: %s' % dm.col.count)
print('column name: %s' % dm.col.name)

Output:

mean: 1.5
median: 1.5
standard deviation: 0.7071067811865476
sum: 3.0
min: 1.0
max: 2.0
unique values: [1, 2, 'not a number']
number of unique values: 3
column name: col

Iterating over rows, columns, and cells

By iterating directly over a DataMatrix object, you get successive Row objects. From a Row object, you can directly access cells.

dm.col = 'a', 'b', 'c'
for row in dm:
    print(row)
    print(row.col)

Output:

+----------+-------+
|   Name   | Value |
+----------+-------+
|   col    |   a   |
| is_happy |   no  |
+----------+-------+
a
+----------+-------+
|   Name   | Value |
+----------+-------+
|   col    |   b   |
| is_happy |  yes  |
+----------+-------+
b
+----------+-------+
|   Name   | Value |
+----------+-------+
|   col    |   c   |
| is_happy |  yes  |
+----------+-------+
c

By iterating over DataMatrix.columns, you get successive (column_name, column) tuples.

for colname, col in dm.columns:
    print('%s = %s' % (colname, col))

Output:

col = col['a', 'b', 'c']
is_happy = col['no', 'yes', 'yes']

By iterating over a column, you get successive cells:

for cell in dm.col:
    print(cell)

Output:

a
b
c

By iterating over a Row object, you get (column_name, cell) tuples:

row = dm[0] # Get the first row
for colname, cell in row:
    print('%s = %s' % (colname, cell))

Output:

col = a
is_happy = no

The column_names property gives a sorted list of all column names (without the corresponding column objects):

print(dm.column_names)

Output:

['col', 'is_happy']

Selecting data

Comparing a column to a value

You can select by directly comparing columns to values. This returns a new DataMatrix object with only the selected rows.

dm = DataMatrix(length=10)
dm.col = range(10)
dm_subset = dm.col > 5
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 6 |  6  |
| 7 |  7  |
| 8 |  8  |
| 9 |  9  |
+---+-----+

Selecting by multiple criteria with | (or), & (and), and ^ (xor)

You can select by multiple criteria using the | (or), & (and), and ^ (xor) operators (but not the actual words 'and' and 'or'). Note the parentheses, which are necessary because |, &, and ^ have priority over other operators.

dm_subset = (dm.col < 1) | (dm.col > 8)
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  0  |
| 9 |  9  |
+---+-----+
dm_subset = (dm.col > 1) & (dm.col < 8)
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 2 |  2  |
| 3 |  3  |
| 4 |  4  |
| 5 |  5  |
| 6 |  6  |
| 7 |  7  |
+---+-----+

Selecting by multiple criteria by comparing to a set {}

If you want to check whether column values are identical to, or different from, a set of test values, you can compare the column to a set object. (This is considerably faster than comparing the column values to each of the test values separately, and then merging the result using & or |.)

dm_subset = dm.col == {1, 3, 5, 7}
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 1 |  1  |
| 3 |  3  |
| 5 |  5  |
| 7 |  7  |
+---+-----+

Selecting with a function or lambda expression

You can also use a function or lambda expression to select column values. The function must take a single argument and its return value determines whether the column value is selected. This is analogous to the classic filter() function.

dm_subset = dm.col == (lambda x: x % 2)
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 1 |  1  |
| 3 |  3  |
| 5 |  5  |
| 7 |  7  |
| 9 |  9  |
+---+-----+

Selecting values that match another column (or sequence)

You can also select by comparing a column to a sequence, in which case a row-by-row comparison is done. This requires that the sequence has the same length as the column, is not a set object (because set objects are treated as described above).

dm = DataMatrix(length=4)
dm.col = 'a', 'b', 'c', 'd'
dm_subset = dm.col == ['a', 'b', 'x', 'y']
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 0 |  a  |
| 1 |  b  |
+---+-----+

When a column contains values of different types, you can also select values by type: (Note: On Python 2, all str values are automatically decoded to unicode, so you'd need to compare the column to unicode to extract str values.)

dm = DataMatrix(length=4)
dm.col = 'a', 1, 'c', 2
dm_subset = dm.col == int
print(dm_subset)

Output:

+---+-----+
| # | col |
+---+-----+
| 1 |  1  |
| 3 |  2  |
+---+-----+

Getting indices for rows that match selection criteria ('where')

You can get the indices for rows that match certain selection criteria by slicing a DataMatrix with a subset of itself. This is similar to the numpy.where() function.

dm = DataMatrix(length=4)
dm.col = 1, 2, 3, 4
print(dm[(dm.col > 1) & (dm.col < 4)])

Output:

[1, 2]

Element-wise column operations

Multiplication, addition, etc.

You can apply basic mathematical operations on all cells in a column simultaneously. Cells with non-numeric values are ignored, except by the + operator, which then results in concatenation.

dm = DataMatrix(length=3)
dm.col = 0, 'a', 20
dm.col2 = dm.col*.5
dm.col3 = dm.col+10
dm.col4 = dm.col-10
dm.col5 = dm.col/50
print(dm)

Output:

+---+-----+------+------+------+------+
| # | col | col2 | col3 | col4 | col5 |
+---+-----+------+------+------+------+
| 0 |  0  | 0.0  |  10  | -10  | 0.0  |
| 1 |  a  |  a   | a10  |  a   |  a   |
| 2 |  20 | 10.0 |  30  |  10  | 0.4  |
+---+-----+------+------+------+------+

Applying a function or lambda expression

The @ operator is only available in Python 3.5 and later.

You can apply a function or lambda expression to all cells in a column simultaneously with the @ operator.

dm = DataMatrix(length=3)
dm.col = 0, 1, 2
dm.col2 = dm.col @ (lambda x: x*2)
print(dm)

Output:

+---+-----+------+
| # | col | col2 |
+---+-----+------+
| 0 |  0  |  0   |
| 1 |  1  |  2   |
| 2 |  2  |  4   |
+---+-----+------+

Reading and writing files

You can read and write files with functions from the datamatrix.io module. The main supported file types are csv and xlsx.

from datamatrix import io

dm = DataMatrix(length=3)
dm.col = 1, 2, 3
# Write to disk
io.writetxt(dm, 'my_datamatrix.csv')
io.writexlsx(dm, 'my_datamatrix.xlsx')
# And read it back from disk!
dm = io.readtxt('my_datamatrix.csv')
dm = io.readxlsx('my_datamatrix.xlsx')

Column types

When you create a DataMatrix, you can indicate a default column type. If you do not specify a default column type, a MixedColumn is used by default.

from datamatrix import DataMatrix, IntColumn
dm = DataMatrix(length=2, default_col_type=IntColumn)
dm.i = 1, 2 # This is an IntColumn

You can also explicitly indicate the column type when creating a new column:

from datamatrix import FloatColumn
dm.f = FloatColumn

MixedColumn (default)

A MixedColumn contains text (unicode in Python 2, str in Python 3), int, float, or None.

Important notes:

  • utf-8 encoding is assumed for byte strings
  • String with numeric values, including NAN and INF, are automatically converted to the most appropriate type
  • The string 'None' is not converted to the type None
  • Trying to assign a non-supported type results in a TypeError
from datamatrix import DataMatrix, NAN, INF
dm = DataMatrix(length=12)
dm.datatype = (
    'int',
    'int (converted)',
    'float',
    'float (converted)',
    'None',
    'str',
    'float',
    'float (converted)',
    'float',
    'float (converted)',
    'float',
    'float (converted)',
)
dm.value = (
    1,
    '1',
    1.2,
    '1.2',
    None,
    'None',
    NAN,
    'nan',
    INF,
    'inf',
    -INF,
    '-inf'
)
print(dm)

Output:

+----+-------------------+-------+
| #  |      datatype     | value |
+----+-------------------+-------+
| 0  |        int        |   1   |
| 1  |  int (converted)  |   1   |
| 2  |       float       |  1.2  |
| 3  | float (converted) |  1.2  |
| 4  |        None       |  None |
| 5  |        str        |  None |
| 6  |       float       |  nan  |
| 7  | float (converted) |  nan  |
| 8  |       float       |  INF  |
| 9  | float (converted) |  INF  |
| 10 |       float       |  -inf |
| 11 | float (converted) |  -inf |
+----+-------------------+-------+

IntColumn (requires numpy)

The IntColumn contains only int values. As of 0.14, the easiest way to create a IntColumn column is to assign int to a new column name.

Important notes:

  • Trying to assign a value that cannot be converted to an int results in a TypeError
  • Float values will be rounded down (i.e. the decimals will be lost)
  • NAN or INF values are not supported because these are float
from datamatrix import DataMatrix
dm = DataMatrix(length=2)
dm.i = int
dm.i = 1, 2
print(dm)

Output:

+---+---+
| # | i |
+---+---+
| 0 | 1 |
| 1 | 2 |
+---+---+

If you insert non-int values, they are automatically converted to int if possible. Decimals are discarded (i.e. values are floored, not rounded):

dm.i = '3', 4.7
print(dm)

Output:

+---+---+
| # | i |
+---+---+
| 0 | 3 |
| 1 | 4 |
+---+---+

If you insert values that cannot converted to int, a TypeError is raised:

try:
    dm.i = 'x'
except TypeError as e:
    print(repr(e))

Output:

TypeError('IntColumn expects integers, not x')

FloatColumn (requires numpy)

The FloatColumn contains float, nan, and inf values. As of 0.14, the easiest way to create a FloatColumn column is to assign float to a new column name.

Important notes:

  • Values that are accepted by a MixedColumn but cannot be converted to a numeric value become NAN. Examples are non-numeric strings or None.
  • Trying to assign a non-supported type results in a TypeError
import numpy as np
from datamatrix import DataMatrix, FloatColumn
dm = DataMatrix(length=3)
dm.f = float
dm.f = 1, np.nan, np.inf
print(dm)

Output:

+---+-----+
| # |  f  |
+---+-----+
| 0 | 1.0 |
| 1 | nan |
| 2 | INF |
+---+-----+

If you insert other values, they are automatically converted if possible.

dm.f = '3.3', 'inf', 'nan'
print(dm)

Output:

+---+-----+
| # |  f  |
+---+-----+
| 0 | 3.3 |
| 1 | INF |
| 2 | nan |
+---+-----+

If you insert values that cannot be converted to float, they become nan.

dm.f = 'x'
print(dm)

Output:

/home/sebastiaan/anaconda3/envs/pydata/lib/python3.10/site-packages/datamatrix/py3compat.py:105: UserWarning: Invalid type for FloatColumn: x
  warnings.warn(safe_str(msg), *args)
â ´ Generating...+---+-----+
| # |  f  |
+---+-----+
| 0 | nan |
| 1 | nan |
| 2 | nan |
+---+-----+
Note: Careful when working with nan data!

You have to take special care when working with nan data. In general, nan is not equal to anything else, not even to itself: nan != nan. You can see this behavior when selecting data from a FloatColumn with nan values in it.

from datamatrix import DataMatrix, FloatColumn
dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, np.nan, 1
dm = dm.f == [0, np.nan, 1]
print(dm)

Output:

+---+-----+
| # |  f  |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+

However, for convenience, you can select all nan values by comparing a FloatColumn to a single nan value:

from datamatrix import DataMatrix, FloatColumn
dm = DataMatrix(length=3)
dm.f = FloatColumn
dm.f = 0, np.nan, 1
print('NaN values')
print(dm.f == np.nan)
print('Non-NaN values')
print(dm.f != np.nan)

Output:

NaN values
+---+-----+
| # |  f  |
+---+-----+
| 1 | nan |
+---+-----+
Non-NaN values
+---+-----+
| # |  f  |
+---+-----+
| 0 | 0.0 |
| 2 | 1.0 |
+---+-----+

Working with continuous data (requires numpy)

To work with continous (or time-series) data, datamatrix provides the SeriesColumn class. In a series column, each cell is itself a series of values.

A more elaborate tutorial on working with time series can be found here:

Mixing two- and three-dimensional data

With column-based or tabular data, every cell is defined by two coordinates: the column name, and the row number; that is, column-based data is two dimensional. But for many kinds of data, two dimensions is not enough.

To illustrate this, let's imagine that you want to store the population of cities over a period of three years. You could do this by simply adding a column for every year, population2008, population2009, population2010:

from datamatrix import DataMatrix

# Not very elegant!
dm = DataMatrix(length=2)
dm.city = 'Marseille', 'Lyon'
dm.population2010 = 850726, 484344
dm.population2009 = 850602, 479803
dm.population2008 = 851420, 474946
print(dm)

Output:

+---+-----------+----------------+----------------+----------------+
| # |    city   | population2008 | population2009 | population2010 |
+---+-----------+----------------+----------------+----------------+
| 0 | Marseille |     851420     |     850602     |     850726     |
| 1 |    Lyon   |     474946     |     479803     |     484344     |
+---+-----------+----------------+----------------+----------------+

In this example, this naive approach is still feasible, because there are only three years, so you need only three columns. But imagine that you want to store the year-by-year population over several centuries. You would then end up with hundreds of columns! Not impossible, but not very elegant either.

It would be much more elegant if you could have a single column for the population, and then give this column a third dimension (a depth) so that it can store the population over time. And that's where the SeriesColumn comes in.

from datamatrix import DataMatrix, SeriesColumn

# Pretty elegant, right?
dm = DataMatrix(length=2)
dm.city = 'Marseille', 'Lyon'
dm.population = SeriesColumn(depth=3)
dm.population[0] = 850726, 850602, 851420 # Marseille
dm.population[1] = 484344, 479803, 474946 # Lyon
dm.year = SeriesColumn(depth=3)
dm.year = 2010, 2009, 2008
print(dm)

Output:

+---+-----------+---------------------------+---------------------+
| # |    city   |         population        |         year        |
+---+-----------+---------------------------+---------------------+
| 0 | Marseille | [850726. 850602. 851420.] | [2010. 2009. 2008.] |
| 1 |    Lyon   | [484344. 479803. 474946.] | [2010. 2009. 2008.] |
+---+-----------+---------------------------+---------------------+

Basic properties of series

Series columns have the same properties as regular columns: mean, median, std, sum, min, and max. But where these properties are single values for regular columns, they are one-dimensional numpy arrays for series columns.

print(dm.population.mean)

Output:

[667535.  665202.5 663183. ]

Indexing

Accessing

The first dimension of a series dimension refers to the row. So to get the population of Marseille (row 0) over time, you can do:

print(dm.population[0])

Output:

[850726. 850602. 851420.]

The second dimension refers to the depth. So to get the population of both Marseille and Lyon in 2009 (the full slice :), you can do:

print(dm.population[:, 1])

Output:

col[850602. 479803.]

Assigning

You can assign to a series columns as you would to a 2D numpy array:

dm = DataMatrix(length=2)
dm.s = SeriesColumn(depth=3)
dm.s[0, 0] = 1
dm.s[1:, 1:] = 2
print(dm)

Output:

+---+---------------+
| # |       s       |
+---+---------------+
| 0 | [ 1. nan nan] |
| 1 | [nan  2.  2.] |
+---+---------------+

If you want to set all cells at once, you can directly assign a single value:

dm.s = 10
print(dm)

Output:

+---+---------------+
| # |       s       |
+---+---------------+
| 0 | [10. 10. 10.] |
| 1 | [10. 10. 10.] |
+---+---------------+

If you want to set all rows at once, you can directly assign a sequence with a length that is equal to the depth of the series:

dm.s = 100, 200, 300 
# Equal to: dm.s[:,:] = 100, 200, 300
print(dm)

Output:

+---+------------------+
| # |        s         |
+---+------------------+
| 0 | [100. 200. 300.] |
| 1 | [100. 200. 300.] |
+---+------------------+

If you want to set all columns at once, you can directly assing a sequence with a length that is equal to the length of the datamatrix:

dm.s = 1000, 2000
# Equal to: dm.s[:,:] = 1000, 2000
print(dm)

Output:

+---+---------------------+
| # |          s          |
+---+---------------------+
| 0 | [1000. 1000. 1000.] |
| 1 | [2000. 2000. 2000.] |
+---+---------------------+