Skip to content

Commit af6c443

Browse files
authored
Merge pull request #19 from Cysharp/utf8
UTF8 String serialization
2 parents 5c28dee + c494495 commit af6c443

20 files changed

+490
-82
lines changed

README.md

Lines changed: 26 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -336,9 +336,9 @@ Serialize has three overloads.
336336

337337
```csharp
338338
// Non generic API also available, these version is first argument is Type and value is object?
339-
byte[] Serialize<T>(in T? value)
340-
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value)
341-
async ValueTask SerializeAsync<T>(Stream stream, T? value, CancellationToken cancellationToken = default)
339+
byte[] Serialize<T>(in T? value, MemoryPackSerializeOptions? options = default)
340+
void Serialize<T, TBufferWriter>(in TBufferWriter bufferWriter, in T? value, MemoryPackSerializeOptions? options = default)
341+
async ValueTask SerializeAsync<T>(Stream stream, T? value, MemoryPackSerializeOptions? options = default, CancellationToken cancellationToken = default)
342342
```
343343

344344
The recommended way to do this in Performance is to use `BufferWriter`. This serializes directly into the buffer. It can be applied to `PipeWriter` in `System.IO.Pipelines`, `BodyWriter` in ASP .NET Core, etc.
@@ -349,6 +349,16 @@ Note that `SerializeAsync` for `Stream` is asynchronous only for Flush; it seria
349349

350350
If you want to do complete streaming write, see [Streaming Serialization](#streaming-serialization) section.
351351

352+
### MemoryPackSerializeOptions
353+
354+
`MemoryPackSerializeOptions` configures how serialize string as Utf16 or Utf8. If passing null then uses `MemoryPackSerializeOptions.Default`, it is same as `MemoryPackSerializeOptions.Utf8`, in other words, serialize the string as Utf8. If you want to serialize with Utf16, you can use `MemoryPackSerializeOptions.Utf16`.
355+
356+
Since C#'s internal string representation is UTF16, UTF16 performs better. However, the payload tends to be larger; in UTF8, an ASCII string is one byte, while in UTF16 it is two bytes. Because the difference in size of this payload is so large, UTF8 is set by default.
357+
358+
If the data is non-ASCII (e.g. Japanese, which can be more than 3 bytes, and UTF8 is larger), or if you have to compress it separately, UTF16 may give better results.
359+
360+
Whether UTF8 or UTF16 is selected during serialization, it is not necessary to specify it during deserialization. It will be automatically detected and deserialized normally.
361+
352362
Deserialize API
353363
---
354364
Deserialize has `ReadOnlySpan<byte>` and `ReadOnlySequence<byte>`, `Stream` overload and `ref` support.
@@ -473,10 +483,10 @@ Payload size depends on the target value; unlike JSON, there are no keys and it
473483

474484
For those with varint encoding, such as MessagePack and Protobuf, MemoryPack tends to be larger if ints are used a lot (in MemoryPack, ints are always 4 bytes due to fixed size encoding, while MsgPack is 1~5 bytes).
475485

476-
Also, strings are usually UTF8 for other formats, but MemoryPack is UTF16 fixed length (2 bytes), so MemoryPack is larger if the string occupies ASCII. Conversely, MemoryPack may be smaller if the string contains many UTF8 characters of 3 bytes or more, such as Japanese.
477-
478486
float and double are 4 bytes and 8 bytes in MemoryPack, but 5 bytes and 9 bytes in MsgPack. So MemoryPack is smaller, for example, for Vector3 (float, float, float) arrays.
479487

488+
String is UTF8 by default, which is similar to other serializers, but if the UTF16 option is chosen, it will be of a different nature.
489+
480490
In any case, if the payload size is large, compression should be considered. LZ4, ZStandard and Brotli are recommended. An efficient way to combine compression and serialization will be presented at a later date.
481491

482492
Packages
@@ -548,15 +558,16 @@ If you request it, there is a possibility to make a detuned Unity version. Pleas
548558

549559
Binary wire format specification
550560
---
551-
The type of `T` defined in `Serialize<T>` and `Deserialize<T>` is called C# schema. MemoryPack format is not self described format. Deserialize requires the corresponding C# schema. Four types exist as internal representations of binaries, but types cannot be determined without a C# schema.
561+
The type of `T` defined in `Serialize<T>` and `Deserialize<T>` is called C# schema. MemoryPack format is not self described format. Deserialize requires the corresponding C# schema. Five types exist as internal representations of binaries, but types cannot be determined without a C# schema.
552562

553563
There are no endian specifications. It is not possible to convert on machines with different endianness. However modern computers are usually little-endian.
554564

555-
There are four value types of format.
565+
There are five value types of format.
556566

557567
* Unmanaged struct
558568
* Object
559569
* Collection
570+
* String
560571
* Union
561572

562573
### Unmanaged struct
@@ -574,7 +585,14 @@ Object has 1byte unsigned byte as member count in header. Member count allows `0
574585

575586
`[int length, values...]`
576587

577-
Collection has 4byte signed interger as data count in header, `-1` represents `null`. Values store memorypack value for the number of length. String is collection(serialize as `ReadOnlySpan<char>`, in other words, UTF16).
588+
Collection has 4byte signed interger as data count in header, `-1` represents `null`. Values store memorypack value for the number of length.
589+
590+
### String
591+
592+
`(int utf16-length, utf16-value)`
593+
`(int ~utf8-length, int utf16-length, utf8-value)`
594+
595+
String has two-form, UTF16 and UTF8. If first 4byte signed integer is `-1`, represents null. `0`, represents empty. UTF16 is same as collection(serialize as `ReadOnlySpan<char>`, utf16-value's byte count is utf16-length * 2). If first signed integer <= `-2`, value is encoded by UTF8. utf8-length is encoded in complement, `~utf8-length` to retrieve length. Next signed integer is utf16-length, it allows `-1` that represents unknown length. utf8-value store byte value for the number of utf8-length.
578596

579597
### Union
580598

sandbox/Benchmark/Benchmarks/DeserializeTest.cs

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,10 +18,10 @@
1818

1919
namespace Benchmark.Benchmarks;
2020

21-
[GenericTypeArguments(typeof(int))]
22-
[GenericTypeArguments(typeof(Vector3[]))]
23-
[GenericTypeArguments(typeof(JsonResponseModel))]
24-
[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
21+
//[GenericTypeArguments(typeof(int))]
22+
//[GenericTypeArguments(typeof(Vector3[]))]
23+
//[GenericTypeArguments(typeof(JsonResponseModel))]
24+
//[GenericTypeArguments(typeof(NeuralNetworkLayerModel))]
2525
public class DeserializeTest<T> : SerializerTestBase<T>
2626
{
2727
//SerializerSessionPool pool;
@@ -51,13 +51,13 @@ public DeserializeTest()
5151
payloadJson = JsonSerializer.SerializeToUtf8Bytes(value);
5252
}
5353

54-
[Benchmark(Baseline = true)]
54+
[Benchmark]
5555
public T MessagePackDeserialize()
5656
{
5757
return MessagePackSerializer.Deserialize<T>(payloadMessagePack);
5858
}
5959

60-
[Benchmark]
60+
[Benchmark(Baseline = true)]
6161
public T? MemoryPackDeserialize()
6262
{
6363
return MemoryPackSerializer.Deserialize<T>(payloadMemoryPack);

sandbox/Benchmark/Benchmarks/SerializeTest.cs

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -70,16 +70,22 @@ public SerializeTest()
7070
jsonWriter = new Utf8JsonWriter(writer);
7171
}
7272

73-
[Benchmark(Baseline = true), BenchmarkCategory(Categories.Bytes)]
73+
[Benchmark, BenchmarkCategory(Categories.Bytes)]
7474
public byte[] MessagePackSerialize()
7575
{
7676
return MessagePackSerializer.Serialize(value);
7777
}
7878

79-
[Benchmark, BenchmarkCategory(Categories.Bytes)]
79+
[Benchmark(Baseline = true), BenchmarkCategory(Categories.Bytes)]
8080
public byte[] MemoryPackSerialize()
8181
{
82-
return MemoryPackSerializer.Serialize(value);
82+
return MemoryPackSerializer.Serialize(value, MemoryPackSerializeOptions.Default);
83+
}
84+
85+
[Benchmark, BenchmarkCategory(Categories.Bytes)]
86+
public byte[] MemoryPackSerializeUtf16()
87+
{
88+
return MemoryPackSerializer.Serialize(value, MemoryPackSerializeOptions.Utf16);
8389
}
8490

8591
// requires T:new(), can't test it.
@@ -113,20 +119,27 @@ public byte[] SystemTextJsonSerialize()
113119
// return orleansSerializer.SerializeToArray(value);
114120
//}
115121

116-
[Benchmark(Baseline = true), BenchmarkCategory(Categories.BufferWriter)]
122+
[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
117123
public void MessagePackBufferWriter()
118124
{
119125
MessagePackSerializer.Serialize(writer, value);
120126
writer.Clear();
121127
}
122128

123-
[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
129+
[Benchmark(Baseline = true), BenchmarkCategory(Categories.BufferWriter)]
124130
public void MemoryPackBufferWriter()
125131
{
126132
MemoryPackSerializer.Serialize(writer, value);
127133
writer.Clear();
128134
}
129135

136+
[Benchmark, BenchmarkCategory(Categories.BufferWriter)]
137+
public void MemoryPackBufferWriterUtf16()
138+
{
139+
MemoryPackSerializer.Serialize(writer, value, MemoryPackSerializeOptions.Utf16);
140+
writer.Clear();
141+
}
142+
130143
//[Benchmark]
131144
//public void BinaryPackStream()
132145
//{
Lines changed: 107 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,107 @@
1+
using Benchmark.BenchmarkNetUtilities;
2+
using BinaryPack.Models.Helpers;
3+
using MemoryPack;
4+
using System.Net.Http;
5+
6+
namespace Benchmark.Benchmarks;
7+
8+
[PayloadColumn]
9+
public class Utf16VsUtf8
10+
{
11+
readonly string ascii;
12+
readonly string japanese;
13+
readonly string largeAscii;
14+
15+
readonly byte[] utf16Jpn;
16+
readonly byte[] utf8Jpn;
17+
readonly byte[] utf16Ascii;
18+
readonly byte[] utf8Ascii;
19+
readonly byte[] utf16LargeAscii;
20+
readonly byte[] utf8LargeAscii;
21+
22+
public Utf16VsUtf8()
23+
{
24+
this.japanese = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん";
25+
this.ascii = "abcedfghijklmnopqrstuvwxyz0123456789";
26+
this.utf16Jpn = MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf16);
27+
this.utf8Jpn = MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf8);
28+
this.utf16Ascii = MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf16);
29+
this.utf8Ascii = MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf8);
30+
31+
this.largeAscii = RandomProvider.NextString(600);
32+
this.utf16LargeAscii = MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf16);
33+
this.utf8LargeAscii = MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf8);
34+
}
35+
36+
[Benchmark]
37+
public byte[] SerializeUtf16Ascii()
38+
{
39+
return MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf16);
40+
}
41+
42+
[Benchmark]
43+
public byte[] SerializeUtf16Japanese()
44+
{
45+
return MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf16);
46+
}
47+
48+
[Benchmark]
49+
public byte[] SerializeUtf8Ascii()
50+
{
51+
return MemoryPackSerializer.Serialize(ascii, MemoryPackSerializeOptions.Utf8);
52+
}
53+
54+
[Benchmark]
55+
public byte[] SerializeUtf8Japanese()
56+
{
57+
return MemoryPackSerializer.Serialize(japanese, MemoryPackSerializeOptions.Utf8);
58+
}
59+
60+
[Benchmark]
61+
public byte[] SerializeUtf16LargeAscii()
62+
{
63+
return MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf16);
64+
}
65+
66+
[Benchmark]
67+
public byte[] SerializeUtf8LargeAscii()
68+
{
69+
return MemoryPackSerializer.Serialize(largeAscii, MemoryPackSerializeOptions.Utf8);
70+
}
71+
72+
[Benchmark]
73+
public void DeserializeUtf16Ascii()
74+
{
75+
MemoryPackSerializer.Deserialize<string>(utf16Ascii);
76+
}
77+
78+
[Benchmark]
79+
public void DeserializeUtf16Japanese()
80+
{
81+
MemoryPackSerializer.Deserialize<string>(utf16Jpn);
82+
}
83+
84+
[Benchmark]
85+
public void DeserializeUtf8Ascii()
86+
{
87+
MemoryPackSerializer.Deserialize<string>(utf8Ascii);
88+
}
89+
90+
[Benchmark]
91+
public void DeserializeUtf8Japanese()
92+
{
93+
MemoryPackSerializer.Deserialize<string>(utf8Jpn);
94+
}
95+
96+
[Benchmark]
97+
public void DeserializeUtf16LargeAscii()
98+
{
99+
MemoryPackSerializer.Deserialize<string>(utf16LargeAscii);
100+
}
101+
102+
[Benchmark]
103+
public void DeserializeUtf8LargeAscii()
104+
{
105+
MemoryPackSerializer.Deserialize<string>(utf8LargeAscii);
106+
}
107+
}

sandbox/Benchmark/Micro/GetLocalVsStaticField.cs

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ public GetLocalVsStaticField()
2424
[Benchmark(Baseline = true)]
2525
public void GetFromProvider()
2626
{
27-
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter);
27+
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter, MemoryPackSerializeOptions.Default);
2828
for (int i = 0; i < 100; i++)
2929
{
3030
writer.GetFormatter<int>().Serialize(ref writer, ref i);
@@ -35,7 +35,7 @@ public void GetFromProvider()
3535
[Benchmark]
3636
public void GetFromLocal()
3737
{
38-
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter);
38+
var writer = new MemoryPackWriter<ArrayBufferWriter<byte>>(ref bufferWriter, MemoryPackSerializeOptions.Default);
3939
var provider = writer.GetFormatter<int>();
4040
for (int i = 0; i < 100; i++)
4141
{

sandbox/Benchmark/Micro/RawSerialize.cs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ public byte[] HandMemoryPackWriterEmpty()
7171
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
7272
}
7373

74-
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
74+
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
7575
try
7676
{
7777
if (value == null)
@@ -106,7 +106,7 @@ public byte[] HandMemoryPackWriterHeaderOnly()
106106
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
107107
}
108108

109-
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
109+
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
110110
try
111111
{
112112
if (value == null)
@@ -140,7 +140,7 @@ public byte[] HandMemoryPackWriterHeaderInt3()
140140
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
141141
}
142142

143-
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
143+
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
144144
try
145145
{
146146
if (value == null)
@@ -174,7 +174,7 @@ public byte[] HandMemoryPackWriterHeaderInt3String1()
174174
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
175175
}
176176

177-
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
177+
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
178178
try
179179
{
180180
if (value == null)
@@ -208,7 +208,7 @@ public byte[] HandMemoryPackFull()
208208
bufWriter = staticWriter = new ReusableLinkedArrayBufferWriter(true, true);
209209
}
210210

211-
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer());
211+
var writer = new MemoryPackWriter<ReusableLinkedArrayBufferWriter>(ref bufWriter, bufWriter.DangerousGetFirstBuffer(), MemoryPackSerializeOptions.Default);
212212
try
213213
{
214214
if (value == null)
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
using System;
2+
using System.Collections.Generic;
3+
using System.Linq;
4+
using System.Text;
5+
using System.Text.Unicode;
6+
using System.Threading.Tasks;
7+
8+
namespace Benchmark.Micro;
9+
10+
public class Utf8Decoding
11+
{
12+
byte[] utf8bytes;
13+
int utf8length;
14+
int utf16length;
15+
16+
public Utf8Decoding()
17+
{
18+
// Japanese Hiragana
19+
var text = "あいうえおかきくけこさしすせそたちつてとなにぬねのはひふへほまみむめもやゆよらりるれろわをん";
20+
utf8bytes = Encoding.UTF8.GetBytes(text);
21+
utf8length = utf8bytes.Length;
22+
utf16length = text.Length;
23+
}
24+
25+
[Benchmark]
26+
public string UTF8GetString()
27+
{
28+
return Encoding.UTF8.GetString(utf8bytes);
29+
}
30+
31+
[Benchmark]
32+
public string Utf16LengthUtf8ToUtf16()
33+
{
34+
return string.Create(utf16length, utf8bytes, static (dest, source) =>
35+
{
36+
Utf8.ToUtf16(source, dest, out var read, out var written);
37+
});
38+
}
39+
}

0 commit comments

Comments
 (0)