Commit message (Collapse) | Author | Age | Files | Lines | |
---|---|---|---|---|---|
* | PR #1662: Replace shift with addition in crc multiply | Pavel P | 2024-05-07 | 1 | -7/+5 |
| | | | | | | | | | | | | Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1662 Merge 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b into cba31a956209e68e4d4049e8a9bc03b1fd67320a Merging this change closes #1662 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1662 from pps83:crc-add 4b2c6c909b573d31a1cccba7cb72d4d8badeef8b PiperOrigin-RevId: 631470883 Change-Id: I4a72be643ed341ddf0e0007418ab4a613a03db4b | ||||
* | Optimize crc32 V128_From2x64 on Arm | Connal de Souza | 2024-04-04 | 1 | -7/+10 |
| | | | | | | | This removes redundant vector-vector moves and results in Extend being up to 3% faster. PiperOrigin-RevId: 621948170 Change-Id: Id82816aa6e294d34140ff591103cb20feac79d9a | ||||
* | PR #1617: fix MSVC 32-bit build with -arch:AVX | Stanislaw Halik | 2024-02-15 | 1 | -3/+4 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Imported from GitHub PR https://github.com/abseil/abseil-cpp/pull/1617 The intrinsics used aren't available on `x86_64` processors while running in 32-bit mode. See: - list of 64-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x64-amd64-intrinsics-list?view=msvc-170) - list of 32-bit intrinsics (https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170) - list of predefined MSVC macros (https://learn.microsoft.com/en-us/cpp/preprocessor/predefined-macros?view=msvc-170) The error message in question: ```console F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(145,32): error C3861: '_mm_crc32_u64': identifier not found return static_cast<uint32_t>(_mm_crc32_u64(crc, v)); ^ F:\dev\opentrack-depends\onnxruntime-build\msvc\_deps\abseil_cpp-src\absl/crc/internal/crc32_x86_arm_combined_simd.h(193,50): error C3861: '_mm_cvtsi128_si64': identifier not found inline int64_t V128_Low64(const V128 l) { return _mm_cvtsi128_si64(l); } ``` Merge 06f5832108a2b01e0a900db51e1c870f7069a1f2 into 797501d12ea767dabdc8d36674e083869e62ee7d Merging this change closes #1617 COPYBARA_INTEGRATE_REVIEW=https://github.com/abseil/abseil-cpp/pull/1617 from sthalik:pr/fix-msvc-32-bit-avx 06f5832108a2b01e0a900db51e1c870f7069a1f2 PiperOrigin-RevId: 607483370 Change-Id: Id2a6f6dd33c2707fe7ffe134e7335916f3fb9da3 | ||||
* | Avoid using the non-portable type __m128i_u. | Derek Mauro | 2023-10-26 | 1 | -5/+5 |
| | | | | | | | | | | | | | | | | According to https://stackoverflow.com/a/68939636 it is safe to use __m128i instead. https://learn.microsoft.com/en-us/cpp/intrinsics/x86-intrinsics-list?view=msvc-170 also uses this type instead __m128i_u is just __m128i with a looser alignment requirement, but simply calling _mm_loadu_si128() instead of _mm_load_si128() is enough to tell the compiler when a pointer is unaligned. Fixes #1552 PiperOrigin-RevId: 576931936 Change-Id: I7c3530001149b360c12a1786c7e1832754d0e35c | ||||
* | Optimize CRC32 Extend for large inputs on Arm | Connal de Souza | 2023-09-21 | 1 | -5/+9 |
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a temporary workaround for an apparent compiler bug with pmull(2) instructions. The current hot loop looks like this: mov w14, #0xef02, lsl x15, x15, #6, mov x13, xzr, movk w14, #0x740e, lsl #16, sub x15, x15, #0x40, ldr q4, [x16, #0x4e0], _LOOP_START: add x16, x9, x13, add x17, x12, x13, fmov d19, x14, <--------- This is Loop invariant and expensive add x13, x13, #0x40, cmp x15, x13, prfm pldl1keep, [x16, #0x140], prfm pldl1keep, [x17, #0x140], ldp x18, x0, [x16, #0x40], crc32cx w10, w10, x18, ldp x2, x18, [x16, #0x50], crc32cx w10, w10, x0, crc32cx w10, w10, x2, ldp x0, x2, [x16, #0x60], crc32cx w10, w10, x18, ldp x18, x16, [x16, #0x70], pmull2 v5.1q, v1.2d, v4.2d, pmull2 v6.1q, v0.2d, v4.2d, pmull2 v7.1q, v2.2d, v4.2d, pmull2 v16.1q, v3.2d, v4.2d, ldp q17, q18, [x17, #0x40], crc32cx w10, w10, x0, pmull v1.1q, v1.1d, v19.1d, crc32cx w10, w10, x2, pmull v0.1q, v0.1d, v19.1d, crc32cx w10, w10, x18, pmull v2.1q, v2.1d, v19.1d, crc32cx w10, w10, x16, pmull v3.1q, v3.1d, v19.1d, ldp q20, q21, [x17, #0x60], eor v1.16b, v17.16b, v1.16b, eor v0.16b, v18.16b, v0.16b, eor v1.16b, v1.16b, v5.16b, eor v2.16b, v20.16b, v2.16b, eor v0.16b, v0.16b, v6.16b, eor v3.16b, v21.16b, v3.16b, eor v2.16b, v2.16b, v7.16b, eor v3.16b, v3.16b, v16.16b, b.ne _LOOP_START There is a redundant fmov that moves the same constant into a Neon register every loop iteration to be used in the PMULL instructions. The PMULL2 instructions already have this constant loaded into Neon registers. After this change, both the PMULL and PMULL2 instructions use the values in q4, and they are not reloaded every iteration. This fmov was expensive because it contends for execution units with crc32cx instructions. This is up to 20% faster for large inputs. PiperOrigin-RevId: 567391972 Change-Id: I4c8e49750cfa5cc5730c3bb713bd9fd67657804a | ||||
* | Remove implicit int64_t->uint64_t conversion in ARM version of V128_Extract64 | Abseil Team | 2023-09-15 | 1 | -1/+1 |
| | | | | | PiperOrigin-RevId: 565662176 Change-Id: I18d5d9eb444b0090e3f4ab8f66ad214a67344268 | ||||
* | Roll forward support for ARM intrinsics in crc_memcpy | Abseil Team | 2023-09-07 | 1 | -3/+28 |
| | | | | | | | | | | | | | | | | | | | This CL rolls forward a previous change which we rolled back temporarily due to compilation errors on x86 when PCLMUL intrinsics were unavailable. *** Original change description *** This change replaces inline x86 intrinsics with generic versions that compile for both x86 and ARM depending on the target arch. This change does not enable the accelerated crc memcpy engine on ARM. That will be done in a subsequent change after the optimal number of vector and integer regions for different CPUs is determined. *** PiperOrigin-RevId: 563416413 Change-Id: Iee630a15ed83c26659adb0e8a03d3f3d3a46d688 | ||||
* | Rollback adding support for ARM intrinsics | Abseil Team | 2023-09-05 | 1 | -28/+3 |
| | | | | | | | | In some configurations this change causes compilation errors. We will roll this forward again after those issue are addressed. PiperOrigin-RevId: 562810916 Change-Id: I45b2a8d456273e9eff188f36da8f11323c4dfe66 | ||||
* | Add support for ARM intrinsics in crc_memcpy | Abseil Team | 2023-09-05 | 1 | -3/+28 |
| | | | | | | | | | | | | This change replaces inline x86 intrinsics with generic versions that compile for both x86 and ARM depending on the target arch. This change does not enable the accelerated crc memcpy engine on ARM. That will be done in a subsequent change after the optimal number of vector and integer regions for different CPUs is determined. PiperOrigin-RevId: 562785420 Change-Id: I8ba4aa8de17587cedd92532f03767059a481f159 | ||||
* | Don't assume that AVX implies PCLMULQDQ when using LLVM on Windows. | Saran Tunyasuvunakool | 2023-02-07 | 1 | -1/+1 |
| | | | | | PiperOrigin-RevId: 507790741 Change-Id: I347357f9a2d698510f29b7d1b065ef73f9289292 | ||||
* | Don't use Arm vector intrinsics when compiling with CUDA in device mode. | Abseil Team | 2023-01-11 | 1 | -1/+1 |
| | | | | | PiperOrigin-RevId: 501464530 Change-Id: I5a0929a2b88c1c158b1696634a65ffda9c4b8590 | ||||
* | Require 64-bit builds on x86 to use CRC32 hardware acceleration | Derek Mauro | 2023-01-04 | 1 | -2/+4 |
| | | | | | | | | 32-bit builds with SSE 4.2 do exist, and these builds do not work without this patch. PiperOrigin-RevId: 499498979 Change-Id: I0ade09068804655652c07d0f1ef13554464a1558 | ||||
* | Allow Cord to store chunked checksums | Derek Mauro | 2022-12-11 | 1 | -2/+3 |
| | | | | | PiperOrigin-RevId: 494587777 Change-Id: I41504edca6fcf750d52602fa84a33bc7fe5fbb48 | ||||
* | Fixes many compilation issues that come from having no external CI | Derek Mauro | 2022-11-30 | 1 | -2/+2 |
| | | | | | | | | | | | | | | | | | | coverage of the accelerated CRC implementation and some differences bewteen the internal and external implementation. This change adds CI coverage to the linux_clang-latest_libstdcxx_bazel.sh script assuming this script always runs on machines of at least the Intel Haswell generation. Fixes include: * Remove the use of the deprecated xor operator on crc32c_t * Remove #pragma unroll_completely, which isn't known by GCC or Clang: https://godbolt.org/z/97j4vbacs * Fixes for -Wsign-compare, -Wsign-conversion and -Wshorten-64-to-32 PiperOrigin-RevId: 491965029 Change-Id: Ic5e1f3a20f69fcd35fe81ebef63443ad26bf7931 | ||||
* | CRC: Get CPU detection and hardware acceleration working on MSVC x86(_64) | Derek Mauro | 2022-11-23 | 1 | -1/+7 |
| | | | | | | | Using /arch:AVX on MSVC now uses the accelerated implementation PiperOrigin-RevId: 490550573 Change-Id: I924259845f38ee41d15f23f95ad085ad664642b5 | ||||
* | Release the CRC library | Derek Mauro | 2022-11-09 | 1 | -0/+260 |
This implementation can advantage of hardware acceleration available on common CPUs when using GCC and Clang. A future update may enable this on MSVC as well. PiperOrigin-RevId: 487327024 Change-Id: I99a8f1bcbdf25297e776537e23bd0a902e0818a1 |