Here I outline the difference between wide data format and long data format and why it matters. The wide format is handy for reading, the long format is handy for manipulating.

Going from wide to long is called a gathering and going from long to wide is called spreading. Similarly in Tercen, we have a gathering step. I will explain what are concepts behind this two format.

Here is a sneak preview of converting between the two.

These dynamic images are taken from gadenbuie.

Nearly all data you receive (usually in Excel) has a wide format (i.e. there are many columns), it is called wide because of this fact. It is a very handy format as it allows you to have a compact view of your data. Columns usually have the names of variables like x, y, and z. Below is an example of a wide format. There are four columns, The first column (id) indicates the data entry, in this case, two entries (row 1 and row 2). The last columns represent the variables x,y, and z. The table looks like the one below.

id

x

y

z

1

a

c

e

2

b

d

f

It is a very handy format in that you can look across the row easily. Here are the advantages:

  • compact (represent it with two rows of data)

  • easy to read (get an overview)

  • easy to compare (compare row1 with row2)

The disadvantages are:

  • not precise (unsure whether id is also a variable)

  • hard to add extra concepts (e.g x,y

  • hard to manipulate it

Now let's look at the long version of this table:

id

key

val

1

x

a

2

x

b

1

y

c

2

y

d

1

z

e

2

z

f

This new transformed table contains the same information as the first but with extra information.

The first remark is what are these columns called key and val?

Another name for this could be the name and value. For example, the first row could be read as x (the name/key) and value a (the value).

The second remark is the length of this format as compared to the wide table.

It is clearly much longer and narrower than the previous. The interesting aspect of this table is if you had more columns in the previous table (say l, m,n) then you would still only have three columns (but more rows).

The third remark is how much repetition is going on as you compare rows. The first two rows have nearly the same entries except for the change in val value.

The fourth remark is how x,y,z now have a name themselves (i.e. key). This allows them to be manipulated just like the other entries. It also follows that extra concepts can be added to them, for example, let say x,y represents group 1 and z group 2. This can be added to the table without much thought.

:

id

key

val

group

1

x

a

group 1

2

x

b

group 1

1

y

c

group 1

2

y

d

group 1

1

z

e

group 2

2

z

f

group 2

In Tercen, long formats allow the user to manipulate the data much easier, for example, the user can easily calculate the mean per group.

In Tercen you can perform gathering two different ways:

  • Using the Gather step

  • During the import of a delimited file

The Gather step allows you to select the columns you wish to gather.

You can also use the Advanced button during the import of delimited files to convert from wide to long during the import. This allows you to select what columns you wish to gather.

Did this answer your question?