datamatrix.operations
A set of common operations that can be apply to columns and DataMatrix
objects. This module is typically imported as ops
for brevity:
from datamatrix import operations as ops
- function auto_type(dm)
- function bin_split(col, bins)
- function fullfactorial(dm, ignore=u'')
- function group(dm, by)
- function keep_only(dm, *cols)
- function pivot_table(dm, values, index, columns, *args, **kwargs)
- function random_sample(obj, k)
- function replace(col, mappings={})
- function shuffle(obj)
- function shuffle_horiz(*obj)
- function sort(obj, by=None)
- function split(col, *values)
- function stack(*dms)
- function weight(col)
- function z(col)
function auto_type(dm)
Requires fastnumbers
Converts all columns of type MixedColumn to IntColumn if all values are integer numbers, or FloatColumn if all values are non-integer numbers.
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=5)
dm.A = 'a'
dm.B = 1
dm.C = 1.1
dm_new = ops.auto_type(dm)
print('dm_new.A: %s' % type(dm_new.A))
print('dm_new.B: %s' % type(dm_new.B))
print('dm_new.C: %s' % type(dm_new.C))
Output:
dm_new.A: <class 'datamatrix._datamatrix._mixedcolumn.MixedColumn'>
dm_new.B: <class 'datamatrix._datamatrix._numericcolumn.IntColumn'>
dm_new.C: <class 'datamatrix._datamatrix._numericcolumn.FloatColumn'>
Arguments:
dm
-- No description- Type: DataMatrix
Returns:
No description
- Type: DataMatrix
function bin_split(col, bins)
Splits a DataMatrix into bins; that is, the DataMatrix is first sorted by a column, and then split into equal-size (or roughly equal-size) bins.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=5)
dm.A = 1, 0, 3, 2, 4
dm.B = 'a', 'b', 'c', 'd', 'e'
for bin, dm in enumerate(ops.bin_split(dm.A, bins=3)):
print('bin %d' % bin)
print(dm)
Output:
bin 0
+---+---+---+
| # | A | B |
+---+---+---+
| 1 | 0 | b |
+---+---+---+
bin 1
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | 1 | a |
| 3 | 2 | d |
+---+---+---+
bin 2
+---+---+---+
| # | A | B |
+---+---+---+
| 2 | 3 | c |
| 4 | 4 | e |
+---+---+---+
Arguments:
col
-- The column to split by.- Type: BaseColumn
bins
-- The number of bins.- Type: int
Returns:
A generator that iterates over the bins.
function fullfactorial(dm, ignore=u'')
Requires numpy
Creates a new DataMatrix that uses a specified DataMatrix as the base of a full-factorial design. That is, each value of every row is combined with each value from every other row. For example:
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=2)
dm.A = 'x', 'y'
dm.B = 3, 4
dm = ops.fullfactorial(dm)
print(dm)
Output:
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | x | 3 |
| 1 | y | 3 |
| 2 | x | 4 |
| 3 | y | 4 |
+---+---+---+
Arguments:
dm
-- The source DataMatrix.- Type: DataMatrix
Keywords:
ignore
-- A value that should be ignored.- Default: ''
function group(dm, by)
Requires numpy
Groups the DataMatrix by unique values in a set of grouping columns. Grouped columns are stored as SeriesColumns. The columns that are grouped should contain numeric values. The order in which groups appear in the grouped DataMatrix is unpredictable.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=4)
dm.A = 'x', 'x', 'y', 'y'
dm.B = 0, 1, 2, 3
print('Original:')
print(dm)
dm = ops.group(dm, by=dm.A)
print('Grouped by A:')
print(dm)
Output:
Original:
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | x | 0 |
| 1 | x | 1 |
| 2 | y | 2 |
| 3 | y | 3 |
+---+---+---+
Grouped by A:
+---+---+---------+
| # | A | B |
+---+---+---------+
| 0 | x | [0. 1.] |
| 1 | y | [2. 3.] |
+---+---+---------+
Arguments:
dm
-- The DataMatrix to group.- Type: DataMatrix
by
-- A column or list of columns to group by.- Type: BaseColumn, list
Returns:
A grouped DataMatrix.
- Type: DataMatrix
function keep_only(dm, *cols)
Removes all columns from the DataMatrix, except those listed in cols
.
Version note: As of 0.11.0, the preferred way to select a subset of
columns is using the dm = dm[('col1', 'col2')]
notation.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=5)
dm.A = 'a', 'b', 'c', 'd', 'e'
dm.B = range(5)
dm.C = range(5, 10)
dm_new = ops.keep_only(dm, dm.A, dm.C)
print(dm_new)
Output:
+---+---+---+
| # | A | C |
+---+---+---+
| 0 | a | 5 |
| 1 | b | 6 |
| 2 | c | 7 |
| 3 | d | 8 |
| 4 | e | 9 |
+---+---+---+
Arguments:
dm
-- No description- Type: DataMatrix
Argument list:
*cols
: A list of column names, or column objects.
function pivot_table(dm, values, index, columns, *args, **kwargs)
Requires pandas
Version note: New in 0.14.1
Creates a pivot table where rows correspond to levels of index
,
columns correspond to levels of columns
, and cells contain aggregate
values of values
.
A typical use for a pivot table is to create a summary report for a data set. For example, in an experiment where reaction times of human participants were measured on a large number of trials under different conditions, each row might correspond to one participant, each column to an experimental condition (or a combination of experimental conditions), and cells might contain mean reaction times.
This function is a wrapper around the pandas.pivot_table()
. For an
overview of possible *args
and **kwargs
, see
this page.
Example:
from datamatrix import operations as ops, io
dm = io.readtxt('data/fratescu-replication-data-exp1.csv')
pm = ops.pivot_table(dm, values=dm.RT_search, index=dm.subject_nr,
columns=dm.load)
print(pm)
Output:
+----+--------------------+--------------------+
| # | 1 | 2 |
+----+--------------------+--------------------+
| 0 | 691.393451812936 | 678.3091036076119 |
| 1 | 1037.4137452306413 | 1076.5579254730912 |
| 2 | 725.8907459323649 | 740.7180629199368 |
| 3 | 690.0324213757542 | 663.2912040537004 |
| 4 | 1061.9616479996344 | 1066.694913085751 |
| 5 | 878.9107412950773 | 868.7606042917906 |
| 6 | 772.3190416083047 | 751.7079807753719 |
| 7 | 640.5894986370438 | 620.1758912269404 |
| 8 | 591.1702219508884 | 576.4774491644316 |
| 9 | 610.0829479542426 | 582.0857663440086 |
| 10 | 912.6923951234676 | 885.8144986324572 |
| 11 | 776.5285874867564 | 744.9990142569052 |
| 12 | 811.9071031332232 | 808.8067775165715 |
| 13 | 763.8125378568926 | 756.239461402817 |
| 14 | 629.1304692714401 | 614.8002285032511 |
| 15 | 1138.8041812832648 | 1099.0619141121608 |
| 16 | 669.6717745408761 | 665.5764135306341 |
| 17 | 667.380042786298 | 654.8964957059492 |
| 18 | 696.0044456339372 | 682.9299482924577 |
| 19 | 703.5121217687149 | 688.2862053908701 |
+----+--------------------+--------------------+
(+ 36 rows not shown)
Arguments:
dm
-- The source DataMatrix.- Type: DataMatrix
values
-- A column or list of columns to aggregate.- Type: BaseColumn, str, list
index
-- A column or list of columns to separate rows by.- Type: BaseColumn, str, list
columns
-- A column or list of columns to separate columns by.- Type: BaseColumn, str, list
Argument list:
*args
: No description.
Keyword dict:
**kwargs
: No description.
Returns:
No description
- Type: DataMatrix
function random_sample(obj, k)
New in v0.11.0
Takes a random sample of k
rows from a DataMatrix or column. The
order of the rows in the returned DataMatrix is random.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=5)
dm.A = 'a', 'b', 'c', 'd', 'e'
dm = ops.random_sample(dm, k=3)
print(dm)
Arguments:
obj
-- No description- Type: DataMatrix, BaseColumn
k
-- No description- Type: int
Returns:
A random sample from a DataMatrix or column.
- Type: DataMatrix, BaseColumn
function replace(col, mappings={})
Replaces values in a column by other values.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=3)
dm.old = 0, 1, 2
dm.new = ops.replace(dm.old, {0 : 'a', 2 : 'c'})
print(dm_new)
Output:
+---+---+---+
| # | A | C |
+---+---+---+
| 0 | a | 5 |
| 1 | b | 6 |
| 2 | c | 7 |
| 3 | d | 8 |
| 4 | e | 9 |
+---+---+---+
Arguments:
col
-- The column to weight by.- Type: BaseColumn
Keywords:
mappings
-- A dict where old values are keys and new values are values.- Type: dict
- Default: {}
function shuffle(obj)
Shuffles a DataMatrix or a column. If a DataMatrix is shuffled, the order of the rows is shuffled, but values that were in the same row will stay in the same row.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=5)
dm.A = 'a', 'b', 'c', 'd', 'e'
dm.B = ops.shuffle(dm.A)
print(dm)
Output:
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | a | d |
| 1 | b | b |
| 2 | c | e |
| 3 | d | c |
| 4 | e | a |
+---+---+---+
Arguments:
obj
-- No description- Type: DataMatrix, BaseColumn
Returns:
The shuffled DataMatrix or column.
- Type: DataMatrix, BaseColumn
function shuffle_horiz(*obj)
Shuffles a DataMatrix, or several columns from a DataMatrix, horizontally. That is, the values are shuffled between columns from the same row.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=5)
dm.A = 'a', 'b', 'c', 'd', 'e'
dm.B = range(5)
dm = ops.shuffle_horiz(dm.A, dm.B)
print(dm)
Output:
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | a | 0 |
| 1 | 1 | b |
| 2 | 2 | c |
| 3 | 3 | d |
| 4 | 4 | e |
+---+---+---+
Argument list:
*desc
: A list of BaseColumns, or a single DataMatrix.*obj
: No description.
Returns:
The shuffled DataMatrix.
- Type: DataMatrix
function sort(obj, by=None)
Sorts a column or DataMatrix. In the case of a DataMatrix, a column must be specified to determine the sort order. In the case of a column, this needs to be specified if the column should be sorted by another column.
The sort order is as follows:
-INF
int
andfloat
values in increasing orderINF
str
values in alphabetical order, where uppercase letters come firstNone
NAN
You can also sort columns (but not DataMatrix objects) using the
built-in sorted()
function. However, when sorting different mixed
types, this may lead to Exceptions or (in the case of NAN
values)
unpredictable results.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=3)
dm.A = 2, 0, 1
dm.B = 'a', 'b', 'c'
dm = ops.sort(dm, by=dm.A)
print(dm)
Output:
+---+---+---+
| # | A | B |
+---+---+---+
| 1 | 0 | b |
| 2 | 1 | c |
| 0 | 2 | a |
+---+---+---+
Arguments:
obj
-- No description- Type: DataMatrix, BaseColumn
Keywords:
by
-- The sort key, that is, the column that is used for sorting the DataMatrix, or the other column.- Type: BaseColumn
- Default: None
Returns:
The sorted DataMatrix, or the sorted column.
- Type: DataMatrix, BaseColumn
function split(col, *values)
Splits a DataMatrix by unique values in a column.
Version note: As of 0.12.0, split()
accepts multiple columns as
shown below.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=4)
dm.A = 0, 0, 1, 1
dm.B = 'a', 'b', 'c', 'd'
# If no values are specified, a (value, DataMatrix) iterator is
# returned.
print('Splitting by a single column')
for A, sdm in ops.split(dm.A):
print('sdm.A = %s' % A)
print(sdm)
# You can also split by multiple columns at the same time.
print('Splitting by two columns')
for A, B, sdm in ops.split(dm.A, dm.B):
print('sdm.A = %s, sdm.B = %s' % (A, B))
# If values are specific an iterator over DataMatrix objects is
# returned.
print('Splitting by values')
dm_a, dm_c = ops.split(dm.B, 'a', 'c')
print('dm.B == "a"')
print(dm_a)
print('dm.B == "c"')
print(dm_c)
Output:
Splitting by a single column
sdm.A = 0
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | 0 | a |
| 1 | 0 | b |
+---+---+---+
sdm.A = 1
+---+---+---+
| # | A | B |
+---+---+---+
| 2 | 1 | c |
| 3 | 1 | d |
+---+---+---+
Splitting by two columns
sdm.A = 0, sdm.B = a
sdm.A = 0, sdm.B = b
sdm.A = 1, sdm.B = c
sdm.A = 1, sdm.B = d
Splitting by values
dm.B == "a"
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | 0 | a |
+---+---+---+
dm.B == "c"
+---+---+---+
| # | A | B |
+---+---+---+
| 2 | 1 | c |
+---+---+---+
Arguments:
col
-- The column to split by.- Type: BaseColumn
Argument list:
*values
: Splits the DataMatrix based on these values. If this is provided, an iterator over DataMatrix objects is returned, rather than an iterator over (value, DataMatrix) tuples.
Returns:
A iterator over (value, DataMatrix) tuples if no values are provided; an iterator over DataMatrix objects if values are provided.
- Type: Iterator
function stack(*dms)
Stacks multiple DataMatrix objects such that the resulting DataMatrix has a length that is equal to the sum of all the stacked DataMatrix objects. Phrased differently, this function vertically concatenates DataMatrix objects.
See also stack_multiprocess()
for stacking
DataMatrix objects that are returned by functions running in different
processes.
Stacking two DataMatrix objects can also be done with the <<
operator. However, when stacking more than two DataMatrix objects,
using stack()
is much faster than iteratively stacking with <<
.
Version note: New in 1.0.0
Example:
from datamatrix import operations as ops
dm1 = DataMatrix(length=2)
dm1.col = 'A'
dm2 = DataMatrix(length=2)
dm2.col = 'B'
dm3 = DataMatrix(length=2)
dm3.col = 'C'
dm = ops.stack(dm1, dm2, dm3)
print(dm)
Output:
+---+-----+
| # | col |
+---+-----+
| 0 | A |
| 1 | A |
| 2 | B |
| 3 | B |
| 4 | C |
| 5 | C |
+---+-----+
Argument list:
*dms
: OrderedDict([('desc', 'A list of DataMatrix objects.'), ('type', 'list')])
Returns:
No description
- Type: DataMatrix
function weight(col)
Weights a DataMatrix by a column. That is, each row from a DataMatrix is repeated as many times as the value in the weighting column.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=3)
dm.A = 1, 2, 0
dm.B = 'x', 'y', 'z'
print('Original:')
print(dm)
dm = ops.weight(dm.A)
print('Weighted by A:')
print(dm)
Output:
Original:
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | 1 | x |
| 1 | 2 | y |
| 2 | 0 | z |
+---+---+---+
Weighted by A:
+---+---+---+
| # | A | B |
+---+---+---+
| 0 | 1 | x |
| 1 | 2 | y |
| 2 | 2 | y |
+---+---+---+
Arguments:
col
-- The column to weight by.- Type: BaseColumn
Returns:
No description
- Type: DataMatrix
function z(col)
Transforms a column into z scores such that the mean of all values is 0 and the standard deviation is 1.
Version note: As of 0.13.2, z()
returns a FloatColumn
when a
regular column is give. For non-numeric values, the z score is NAN. If
the standard deviation is 0, z scores are also NAN.
Version note: As of 0.15.3, z()
also accepts series columns, in
which case the series is z-transformed such that the grand mean of
all samples is 0, and the grand standard deviation of all samples is
1.
Example:
from datamatrix import DataMatrix, operations as ops
dm = DataMatrix(length=5)
dm.col = range(5)
dm.z = ops.z(dm.col)
print(dm)
Output:
+---+-----+---------------------+
| # | col | z |
+---+-----+---------------------+
| 0 | 0 | -1.2649110640673518 |
| 1 | 1 | -0.6324555320336759 |
| 2 | 2 | 0.0 |
| 3 | 3 | 0.6324555320336759 |
| 4 | 4 | 1.2649110640673518 |
+---+-----+---------------------+
Arguments:
col
-- The column to transform.- Type: BaseColumn
Returns:
No description
- Type: BaseColumn